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ABSTRACT 


There are many variables that contribute to the explanation of why a person 
enlists in the Army. To efficiently manage personnel policy in regards to the 
recruitment process, the impact and significance of these variables needs to be fully 
understood. Ordinary least squares regression analysis is a powerful and useful tool in 
helping to explain the interaction of these variables. The understanding of the theories 
and methods behind this approach is essential. . Army analysts apply regression derived 
results every day in a myriad of situations and operational contexts. Misuse or 
misunderstanding of these results can lead to inaccurate recommendations to the 
decision maker. 

The thesis develops the framework for a parsimonious linear statistical model of 
quality enlistment contracts for the U.S. Army. There is a need for such a model that 
can be utilized by USAREC and DCSPER analysts to perform quick response analysis 
to ‘what if’ questions. | 

In order to facilitate further model enhancement and use, it 1s developed in a 
step-by-step fashion. The author uses a ‘walk through’ approach and thoroughly 
discusses the assumptions, procedures and analytical tools that were utilized in the 
model development. This approach was specifically requested by the Army analysts at 
USAREC. 
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I. INTRODUCTION 


The Commander, United States Army Recruiting Command (USAREC), is 
responsible for developing and issuing policies, procedures and standards for the 
recruitment of personnel into the United States Army. Each year, the Deputy Chief of 
Staff of Personnel (DCSPER) generates an accession mission based on the number of 
attritions and changes to the overall endstrength. This mission is then given to the 
Commander, USAREC. It is changed and updated throughout the year as policy 
decisions and fiscal and Congressional constraints dictate. This accession mission is 
broken down into several different categories relating to types (male, female, prior 
service, non-prior service) and quality (high school graduate, non-high school graduate, 
mental category ILILIIIA,IIIB,IV,V). Historically, the largest problem in attaining 
these requirements has been in the enlistment of male, high school graduate, non-prior 
service, mental category I-IIIA (GSM I-IIIA) recruits. In this study, the problem of 
attempting to predict the number of these quality male recruits for future years 1s 
modelled. Ordinary least squares multiple linear regression analysis and stepwise 


regression analysis 1s utilized with an historical data base provided by USAREC. 


À. PROBLEM STATEMENT 

There are several objectives of this thesis. They vary in both scope and 
magnitude. 

First and foremost is the near term need for the development of a predictive 
model to be used by the active duty Army ‘green suit’ analysts (hereafter refered to as 
Army analysts) stationed at USAREC headquarters and at the DCSPER, Department 
of the Army. At these agencies, major policy decisions are routinely contemplated. 
These decisions are usually concerned with aggregate responses to possible major 
personnel policy changes and/or budgetary réalignments. There is a need for a quick 
response mechanism to answer various ‘what if questions concerning the quality of the 
force. 

In this regard, it is desired to build a model that can be easily understood and 
quickly updated. Although a sufficient degree of complexity is an inherent desired 
feature of any proposed model, the true value of this particular model may be more in 


its ability to be maintained and updated, and its propensity for understanding by the 


(continuous) change of Army analysts that are stationed for a tour of duty at these 
agencies. The Army has initiated many studies in this field (usually through 
contracting) with various results. Where applicable, these studies will be referenced in 
the body of this thesis. There 1s an inherent problem, however, in the Army's ability to 
keep up with these efforts, either in the updating of the data base or in the level of 
understanding of the current, on-line Army analysts assigned to USAREC and 
DCSPER. It is thought by many that an in-house model, easily updated and 
universally understood, would be preferable to a more complex yet harder to 
comprehend effort. The need for simplicity for the analysts and understanding by the 
decision makers is a cornerstone on which this model will be derived. 

It is not envisioned that this model will be a panacea to quality enlistment 
modeling. On the contrary, it will be promulgated as a ‘first effort’ on how to go 
about developing a model with the data base as given. 

A concerted effort will be put forth on the whys and hows of going through the 
ordinary least squares and stepwise regression analysis used in developing this model. 
Most Army analysts have little knowledge or experience in the detailed theory of 
regression analysis. Their familiarity with the subject matter may be limited to 
graduate level studies (if at all) or to some contact with regression models in previous 
duty assignments. The community of experts in the manpower modeling field is small 
and few are in the active Army. The chapters of this thesis will cover the details of the 
model, some of the theory of its development and application, and possible sources of— 
further study that nceds to be accomplished. It is desired that an examination of this 
material, some of which will be heuristic in nature, will bridge this gap in knowledge. 
Hopefully, it will lead to a better understanding of the dynamics that affect the quality 
of the force and the accepted methods of modeling the interrelationships involved. 
Army analysts must be able to do more than just ‘crunch the numbers’ that they are 
given by other analysts. Forming a base for the understanding and refinement of this 


model is another major objective of this thesis. 


B. BACKGROUND 

In February, 1986, the Chief of Staff, USAREC, tasked the Programs, Analysis 
and Evaluation Directorate (PAE) to review the current list of enlistment supply 
models and to reevaluate and. assess what factors (variables) were contributing 


significantly to explaining quality male enlistment contracts. This thesis is in partial 
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fulfillment of that requirement. Although there have been many studies in this field, 
the need still exists for continuing development in order that the Programs Analvsis 
and Evaluation Directorate may have an in-house model with current data and 
accessable to Army analysts. Other studies, such as the Enlistment Supply Model 
published by Daula and Smith, [Ref. 1] and the Recruiting Resources Allocation 
System by ABT Associates, Inc., [Ref. 2] are commendable. The problem is that they 
are neither readily accessable nor easily updated by USAREC or DCSPER personnel. 
Further, the level of understanding required 1s well beyond the expertise of the typical 
Army analyst. He must bear the burden of providing the day-to-day answers to 
various decision makers asking a plethora of questions on a litany of different issues. 


With his day-to-day plight in mind, the study objective of this thesis was conceived. 


C. STUDY OBJECTIVE 

The primary objective of this thesis is to develop a model using ordinary least 
Squares multiple linear regression analysis and stepwise regression analysis to predict 
total Army male quality (GSM I-IIIA) contracts for future years. Special emphasis 1s 
placed on the explanation of the methods and techniques used to derive this model. 
All data elements must be readily obtainable and possess some potential for future 


prediction. 


D. THE DATA 

A longitudinal data base for this study was provided by PAE, USAREC (Table 
1). The data 15 cross sectional in that it is broken down by recruiting battalions 
(1A,1B,...,6L) and time series in that it provides data for each of these battalions by 
year (1982,1983,1984,1985). Knowing the structure of the data has important 
implications as to the types of techniques that will be employed in the regression 
analysis. Of the 56 recruiting battalions of USAREC, data elements for 55 were made 
available (battalion 3L, San Juan, Puerto Rico was omitted). In all, the data base 
contained 19 variables. For a more detailed explanation of the data, to include 


variable descriptions, see Appendix D. 


Е. A REGRESSION REVIEW 

If one accepts the premise that historical actualities can be used as a basis to 
predict future events, then regression analysis is a powerful tool that can provide much 
insight into the predicting phenomenon. The principle behind ordinary least squares is 


as follows. 
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PARTIAL LIST OF DATA PROVIDED BY USAREC 
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Using some of the data for Contracts (CONT) and Propensity (PROP) from 


Table 1 above, draw a straight line through a cluster of the plotted data points on a 


Then, for each point, find the vertical distance from the 


scatter diagram (Figure 1.1). 


straight line, square this distance, and then add together all of the squared distances. 


Of all the straight lines that could be possibly drawn through the points on the graph, 


the best-fitting line is the one with the smallest sum of the squared distances. This line is 


called the regression line. The signed (positive or negative) distance from any point to 


the regression line is called the residual. It is the difference between the actual value of 


Contracts (IA ACTUAL) and the value of Contracts that the regression line predicts 
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Figure 1.1 Graphical Representation of 
rdinary Least Squares Regression 


(1А PREDICTED). The residuals represent the error in the model. 1f there were по 
error in the model, and therefore, no residuals, then the regression line would pass 
through point IA ACTUAL and the residual would equal zero. In Figure 1.1, the 


13 


residual = -384 for BN IA. The sum of all of the residuals squared is called the sum of 
squares about the regression, or > (Y. - Y.)2 . [Ref. 3] Without the theory of 
regression, if asked to predict next year's contracts (or any year's contracts), one would 
choose the mean or average number of contracts as the best predictor. The mean 15 
represented in Figure l.l as Y = 1185. The square of the distance between the 
average value and the predicted value is called the sum of squares due to regression, or 
y (Y, - ү)? < The mean 760 aie Where n equals the number of data 
points. In this example, YY; = 657+ 1585+ 1217 and n — 3, so Y — 1185. Another 
important term, called the total sum of squares corrected for the mean 1s equal to the 
addition of the sum of squares due to the regression plus the sum of squares about the 
regression. Algebraically, this is >` (Y; - ү)? = pi (Y; - Y + Y (V. - 34 It will be 
helpful to keep Figure 1.1 in mind as this thesis is read. Although the figure portrays a 
simple linear regression of two variables (CONT being the dependent variable on the 
vertical axis and PROP being the independent variable on the horizonal axis), 1t has 
direct translation to the theory of multiple linear regression. In multiple linear 
regression, the objective is still to minimize the squares of the distance between the actual 
and the predictcd values, only now there are several (instead of two) dimensions. 
Graphical interpretations cannot be made above three dimensions. Above three 
dimensions, the regression line becomes a regression hyperplane in the hyperspace 
defined by the independent variables. The important thing to remember, however, 15 
that all of the mathematics required to. derive the regression line for simple regression 
are still valid for multiple regression. Therefore, the analysis of multiple regression will 
rely heavily on the interpretation. of these mathematically derived values (or 
estimators). The mathematically derived estimators for the regression line in Figure 1.1 


is called a regression equation. This regression equation is given in the form : 
where the variables are: 


Y = CONTRACTS 
и 


CONT = the dependent variable 
PROP = the independent variable 


e 


and the parameter estimators are: 


Ро = 1700 = the intercept with the dependent variable axis 


рү = - 44.8 = the slope of the regression line 
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and the model error is represented by: 


; . 40 cae > 
$ = residual with assumed distribution N(0,6^) 


(residuals are also assumed to be independently distributed) 


In looking at this particular equation, it seems counterintuitive that one would 
predict that, as the propensity for service goes up, the actual number of contracts goes 
down. This is because of the negative slope of the regression line which can be 
determined mathematically by the negatively signed parameter estimator for the slope. 
The signs of parameter estimates are important. The analyst must be cognizant of 
these anomalies and be prepared to think through the interpretation. of his 
mathematical results. Hopefully, this thesis will explain this phenomenon. This study 
will outline many key estimators, how they are derived and their various uses. It is 
imperative, however, to understand Figure 1.1 before moving on into the body of this 


thesis. 


F. INITIAL ASSUMPTIONS 

There are several assumptions which should be explicitly stated. First of all, it is 
assumed that the data provided is accurate. This is imperative to the mechanics of 
model building and the analysis of the data. 

More importantly, however, is the assumption that the personal and 
environmental statistical data upon which model is based have some effect on an 
individual's decision as to whether or not to enlist. Implicit in this assumption is that 
persons living in different areas of the country with different environments will behave 
differently. Also implicit is that different persons facing similar environments will 
behave in a similar manner. These assumptions, and the assumption that this behavior 
stays relatively stable across time, are fundamental to the cross sectional and time 
series regression analysis that will be required. 

Finally, since a linear regression model is being built, it 1s necessary to assume 
that trends will continue exactly as they have in the past. Over the near term, this 1s a 
reasonable assumption. Over the long term, it is not. This implies that the predictions 
from the model will be more accurate for the next one or two time periods than for 
more future time periods. This is because real events rarely behave in a linear manner 


over long periods of time. 
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G. THESIS OUTLINE 

This thesis develops and explains a model for the prediction of future GSM 
I-IIIA contracts. It is developed to predict the total contracts for a ‘typical’ Army 
recruiting battalion. In Chapter II, an outline is presented on how the regression 
model in this thesis will be built. Chapter III details some of the preliminary analysis 
and planning that led to the model formulation. Chapter IV continues through the 
development process and outlines many helpful statistical tools for data and regression 
analysis. Chapter V presents the model in detail and the results of the fitting of the 
model to the finalized data base. The last chapter, Chapter VI, lists the conclusions 
and recommendations of this study. Several Appendixes are included to enhance . 
understanding and are referenced throughout the body of the thesis. A List of 
Appendixes is provided on page 6. Appendix A may be of particular interest. It is a 
select glossary of terms used in this study. If a certain term 1s unfamiliar, this is the 


first place one should look. 


H. PROGRAMMING LANGUAGES AND STATISTICAL PACKAGES. 

The programing languages used in the completion of this project are FORTRAN 
77 (the 1977 update of the Formula Translation language) and APL (A Programing 
Language). The statistical packages used were GRAFSTAT (IBM Corporation) and 
the SAS-Statistical Analysis System Version V (SAS Institute Incorporated). With the 
realization that not all of these computational assets are readily available to most 
Army analysts, virtually all analysis and most of the required graphics that are 
presented can be accomplished using the SAS statistical package. This is in accordance 
with the current capabilities of both DCSPER and USAREC. Some GRAFSTAT 
graphics (such as Figure 1.1) will be presented only for the purpose of enhancing visual 


understanding. 


IIl. BUILDING REGRESSION MODELS 


Linear regression analysis is applicable to a vast array of subject matter. Linear 
regression models are built so that researchers can test the validity or falsity of 
hypothesized functional relationships. The purpose of the model that will be built in 
this thesis is to try to extract the main features of the relationships that are hidden or 
implied in the tabulated data in Table 1 on page 12. 

Before one starts building a model, 1t is useful to have an outline of how to go 
about the process. This chapter provides the basic structure that will be followed in | 
Chapters III, IV and V. 

There are three distinct phases of building regression models. They are the 
Planning Phase, the Development Phase and the Verification and Maintenance Phase. 
[Ref. 3:p. 414] 

Building a regression model is a time consuming task. It is made even more time 
consuming by the requirement to fully explain and document assumptions, methods, 
and results. Documentation is essential because one must be very careful in the use of 
multivariable regression analysis. Results from predictive models can be easily 
misinterpreted or misused. The analyst is wise to state his assumptions and desired 

. goals of the model in order to minimize the potential for misunderstanding. The 
figures of this chapter provide flowcharts that can be followed when faced with 
building a regression model. Although these flowcharts are generic in nature, they 
detail the special problems encountered when dealing with cross sectional and time 
series data. 

The regression review and Figure 1.1 in Chapter 1 discuss a simple regression 
approach. This thesis, however, will be describing some methods for building 
multivariate regression models. When analyizing multivariate models, the analyst must 
rely on many statistical indicators. Although these indicators will be mentioned in this 


chapter, a more detailed explanation will be provided in Chapters III,IV and V. 


A. THE PLANNING PHASE 

As can be seen in Figure 2.1, the first and foremost task in model building is to 
define the problem. Sometimes this is the most difficult step. What is the analyst 
really trying to accomplish? The problem statement must be specific, understandable 


and to the point. 


IE 


DEFINE THE 
PROBLEM 


IS THE 
SELECT THE DATA BASIC CHECK FOR 
DEPENDENT AND TO THE COLLINEARITY 
INDEPENDENT 
PROBLEM? 
VARIABLES 


IS RUN THE 
CHECK DATA 
DATA | FOR : FIRST ESTABLISH 


GOALS 
AVAILABLE ACCURACY REGRESSION 


ARE TIME 
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a AVAILABLE? 


Y 
GO TO THE 


DEVELOPMENT 
PHASE 





Figure 2.1 The Planning Phase of Model Building 


Next comes the data selection. Both the carrier (independent) and the response 
(dependent) variables must be clearly identifiable, readily available and as complete as 
possible. One should ‘brainstorm’ to try to think of any variable which might be 
relevant to the problem. 

One of the first tasks is to check the data for validity. Histograms and scatter 
plots are excellent tools for this. Look at the data distribution. Pay close attention to 
the outliers. Ask if there are valid explanations as to why some of the data looks as if 
it does not belong. If necessary, consult the experts for advice. Also pay particular 
attention to the range of the data. Data that varies little will sometimes provide 


artificially high or artificially low values for the degree to which the model fits the data. 
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The regression hyperplane must fit through the hyperspace that is defined by the 
carrier variables. Small relative ranges tend to shrink this hyperspace and obtaining 
good predictions will become difficult. 

Once the data has been verified, run the first regression. At first, it is only 
necessary to look at a few basic indicators. The analyst must be familiar with the 
information that the ANOVA table is providing. Stepwise regression is a powerful and 
widely accepted tool that can be extremely helpful when looking for significant 
variables that are basic to the problem. Stepwise regression is more fully explained in 
Appendix A. The analyst needs to become familiar with the ideas behind the 
correlation matrix and what it is indicating about multicollinearity. Multicollinearity _ 
arises whenever two or more independent variables used in the regression are not 
independent but are correlated. Among other things, the presence of multicollinearity 
will lead to larger standard errors in the model. Also it is helpful to understand the 
Variance Inflation Factor statistic and the Condition Index in the Variance Proportion 
Matrix. All of these indicators and procedures will be discussed in Chapter III. The 
first regression should provide a very rough indication of what kind of fits are going to 


be possible. 


Finally, before leaving the Planning Stage, it needs to be determined whether 
there will be time and resources available to complete the task correctly. “Half efforts’ 
will lead to incorrect results and a lack of confidence in both the analyst and the 
regression procedures. The bottom line is that if time and resourses are not available, 
then stop. Again, Chapter III provides a "walk through’ of the procedures that are 


detailed in this section. 


B. THE DEVELOPMENT PHASE 

This section provides a brief outline of the development phase of model building. 
Chapter IV will discuss in detail the concepts and statistical indicators that are outlined 
in this section. 

The first regression from the Planning Phase tells the analyst quite a bit about 
the behavior of the data in the model. Once the decision has been made to go ahead 
with the modelling effort, one moves to the Development Phase of model building. 
Many different approaches to the regression problem can occur during this phase. 

The analyst may feel uneasy about some facet of the initial regression findings. 


The Development Phase is time consuming in that trial and error 1s the normal method 
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Figure 2.2 The Development Phase of Model Building 


of testing various ideas. Many times, ideas evolve from the results of previous 
experiments. This is the hallmark of the scientific process. Figure 2.2 outlines the 
Development Phase of regression model building. 

Sometimes new variables are derived from raw data. This is usually because the 
analyst has some idea that it makes sense to do so, or because the original regressions 
are not behaving in an intuitive manner. This is similar to what happened in Figure 
1.1 on page 13, where an increase in PROPENSITY ТӨКӨН sin есте 7٦ 
CONTRACTS. In the model that will be developed in this thesis, three out of the five 


variables that are finally utilized were derived from raw data. 


Once the analyst is satisfied with the data, it must be separated into 
cross-sectional groupings (all battalions) by time period (year). For the data in Table 1 
on page 12, this implies that it is separated into four distinct groups; all battalion data 
for 1982, all battalion data for 1983 and so on. The purpose of this procedure is to 
check for heteroscedasticity without having the mathematical results biased by 
autocorrelation. WHeteroscedasticity 1s a condition where the error terms (£) are not 
constant for all values of the independent variables. Autocorrelation is a condition 
where the error terms from different observations are correlated. Both of these 
conditions will affect the size of the standard error of the regression coefficient and 
therefore bias the results of the regression model. 

Now each grouped (cross-sectional) data set is run through the regression 
procedures. The correlation matrix will indicate highly correlated carrier variables and 
the stepwise procedure will show which are the most significant in explaining the fit of 
the regression line. It 1s now time to drop those variables that are insignificant or are 
contributing the most to multicollinearity. Again, new variables may become apparent 
at any time. They should be included and scrutinized by the analyst until all practical 
possibilities have been exhausted. 

Rerun the regression for all of the finalized groups of data. Look at the results 
and compare between time periods. Are the parameter estimates comparatively stable? 
Are they signed the same? Are the same variables significant in each time period? Are 
they comparable in magnitude? If the groups are different, are they significantly 
different? Most of the answers to these questions are judgment calls on the part of the 
analyst. Whatever the call, he should be able to justify his decision based upon the 
knowledge of the problem and the underlying data base. Next, plot the residuals 
versus the predicted values and look for any signs of heteroscedasticity. If 
heteroscedasticity 1s present, the results of the regression cannot be considered valid. 
Unless the analyst has some valid reason to do otherwise, this should be the first time 
that he considers transforming the data. Transformations inherently lead to a lack of 
understanding in the modeling process and should be avoided up until the point at 
which the benefit to the model derived by the transformation exceeds the detriment to 
the user in the understanding of the model. If heteroscedasticity is significant, then 
apply the appropriate variance stabilizing transformation to the groups of data. 
[Ref. 3:p. 238] If heteroscedasticity 1s not a problem, or if the transformation renders 
the problem insignificant, it is time to re-pool the data back to its original longitudinal 


structure. 
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The data set would look exactly like Table 1 again, except that now the analyst 
will be working only with those variables that were found to be significant in the 
cross-sectional analysis. 

Run the regression on the entire pooled data set. Plot the residuals and check for 
autocorrelation. If autocorrelation is present, the results of the regression are biased 
and the standard error of the estimates is inaccurate. Accept or fail to accept the 
hypothesis on autocorrelation in the residuals using a runs test or the popular 
Durbin-Watson test. If autocorrelation seems to be a problem, then the true 
correlation coefficient of the data structure needs to be determined and another 
transformation on the data needs to be performed. Rerun the regression using the 
transformed data and then double check to ensure that the effects of autocorrelation 
are no longer present. The ‘best regression equation’ has now been determined. 

Finally, check to see that the model is fulfilling the goals as set forth in the 
Planning Stage. If not, it may be time to start anew, possibly with new variables. Or, 
it may be time to re-access the goals of the model. Whatever the case, once the 
analyst has decided that the “best equation’ has been achieved, it is time to move to the 
model Validation and Maintenance Phase. Chapter IV details a step-by-step method 
for the development of the GSM I-IIIA model that is being built in this thesis. 


C. VALIDATION AND MAINTENANCE PHASE 

If the analyst feels comfortable about the achievement of the goals and the 
stability of the model after the Development Stage, then he has gone a long way 
towards the validation of the model. Figure 2.3 provides a step-by-step summary of 
this phase of model building. Chapter V details this phase as it applies to the 
regression model that is being built in this thesis. The concepts that are outlined in 
this section are more fully explained in Chapter V. 

One last check needs to be performed to see if there is any systematic lack of fit 
in the model. Remember that the residuals contain all of the information on the lack 
of fit in the model and they should be checked for any possible pattern. 

Next, validate the model. Validation merely implies checking to sec if the model 
makes sense. Check the model by trying a few predictor variables and sce if the 
response variable makes sense. For instance, try some data points near an extreme of 
the prediction space to see if the response is coherent with that extreme. There are 
many methods of validation and there is really no ‘best method’. (Ref. 3:p. 420] As it 


is With variable selection, it is up to the judgment of the analyst. 
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Figure 2.3 The Validation and. 
Maintenance Phase of Model Building 

Is this equation useful and are these parameters reasonable? This is the final 
vahdation test of the model. Does it pass the scrutiny of the experts? The final 
product should achieve the desired objectives as outlined in the initial. problem 
statement. Obviously, the intermediate goals were either achieved or revised in order 
to get to this final stage. The only thing left to do is to establish the proper 
documentation for the model, this should include all assumptions and the ranges of the 
inputs for which the model is valid. 

Finally, the model needs to be maintained, updated and periodically re-evaluated 
for accuracy and validity. This can be especially difficult for complex models that are 
to be maintained by Army analysts in a high turnover environment. One to the goals 
of this model has been to attempt to keep this maintenance procedure as simple as 
possible. It is now time to move on to Chapters III, IV, and V to see how well this 


goal was accomplished. 
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Ill. PLANNING THE GSM I-IIIA MODEL 


This chapter explains the specifics of planning the GSM 1-ША model. Much 
reference will be made to Figure 2.1 of Chapter 2 which provides an outline of the 


Planning Phase. It may be useful to review Figure 2.1 at this time. 


A. DEFINING THE PROBLEM 

The problem definition stems directly from the study objective. This thesis will 
detail a step-by-step procedure which can be used to build a predictive model for future ` 
year GSM I-IIIA contracts. The data for this model must be easily updated and 
readily available. The data should also have some potential for future prediction. This 
model will be developed to predict the results of a 'typical' Army recruiting battalion 
and is not designed for predicting any specific battalion results. Since one of the major 
objectives of the thesis is to provide a "walk through’ for the reader on the hows and 
whys of model building, the author has chosen the first person plural as the pronoun of 


choice. We will now attempt to build this model. 


B. SELECTION OF THE INDEPENDENT AND DEPENDENT VARIABLES. 
Data for this project was provided by the Programs Analysis and Evaluation 
(PAE) section of USAREC. It is as appears in Table 1 on page 12 and as described in 
Appendix B. Since this model is now in the Planning Phase we should be 
‘brainstorming’ in order to try to think of any variable which might be relevant to the 
problem. We are trying to predict total contracts, and the variable CONT from Table 
| seems to be the logical and ideal choice for the dependent variable. Also, we figure 
that other variables, both endogenous and exogenous, may play some role in 
determining the number of contracts signed. Many variables, such as the Consumer 
Price Index (CPI), are contemplated. These variables, mostly of the exogenous variety, 
might be useful in capturing some of the social or demographic dynamics of the 
enlistment process. The problem is, however, that these statistics are not available at 
the cross-sectional (battalion) level and time specific (by year) period that would fit 


with the rest of the data structure. 
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C. CHECKING THE DATA 

The final list of variables from the Planning Stage are as presented in Table 1. 
The only exception is with the battalion term, BN. Being alpha-numeric in nature, it 
can not be plotted in the multivariate hyperspace in order to determine a least squares 
fit. The analyst can substitute a numerical counterpart if he desires to use the 
battalion as a carrier variable. Therefore, the battalions are numbered from 1 to 55 
instead of from 1A to 6L. This variable will be more thoroughly discussed as the 
model is developed. Table 1 1s complete in that there are no missing data entries for 
any battalion during any year. Appendix B provides a detailed explanation of the data 
that will be used in this thesis. After checking the data using histograms and scatter 
plots and carefully verifying the outliers, the Planning Stage finalized matrix of 


longitudinal data appears below. 





CONT BN YEAR RCTR UNEMP DOD-A 

657 В 13 71150577 1348 bo 

805 ISS 52959 7.93 TS. 1370 by 
npa Pa E. : : B= 

1396 Цо 1235 970 63 ....... 1869 | 617 


where Y = 220x 1 matrix (a column vector of the dependent variables) 


X = 220x18 matrix (a column vector of l's catonated with the 
220x17 matrix of the independent variables) 
B= 18x] matrix (a column vector of parameter estimates) 


Notice that this is the initial matrix format required for the Normal Equations 
for Multiple Linear Regression (see definition in Appendix A). The column vector of 
l's in the X matrix is required for the matrix multiplication of the b; values in the p 


Matrix. 


D. THE FIRST REGRESSION 
As stated in the introduction, SAS will be utilized as the statistical package for all 


of the analysis in this thesis. 
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Appendix C shows the basic format for the SAS input. Not every procedure was 
required for every step of the model development process. With few exceptions, 
Appendix C lists all of the steps that were used throughout Chapter III and some of 
Chapter IV. At each step in the Planning and Development Stage, this thesis will 
specify the procedure that is important to that particular step and provide a table of 
the output and diagnostics from SAS that are pertinent to that step. 

Running the first regression with the data as in Table 1 (except I replaces 1A, 2 


replaces 1B, etc), several outputted indicators are obtained. 


E. DETERMINING IF THE DATA IS BASIC 

Table 2 is the printout of the ANOVA table. The MODEL statement in SAS © 
automatically provides this output. [Ref. 4] Reference is made to Figure 1.1 on page 
13 for a graphical interpretation and to Appendix A for the algebraic interpretation of 
the values in the ANOVA table. 


TABEE 2 
ANALYSIS OF VARIANCE TABLE FROM SAS 


DEP VARIABLE: CONT 
ANALYSIS OF VARIANCE 


SUM OF MEAN 
SOURCE DF SQUARES SQUARE F VALUE РКОВ>Ғ 
MODEL 127 23653906 1391406 382.419 0. 0001 
ERROR 202 734963 3638. 429 
C TOTAL 219 24388868 
ROOT MSE О R-SQUARE 029599 
DEP MEAN 1007. 241 ADJ R-SQ Ü 9673 
Cs 9 998577 


For illustrative purposes, the values in the ANOVA table in Table 2 are derived 
below. A few important facts to remember is that the MS ERROR is the best 
(unbiased) estimate of the variance of the residuals and, therefore, the ROOT MSE is 


the best (biased) estimate of the standard deviation of the residuals. 


MODEL df = number of independent variables = 17 

ERROR df = number of data lines - MODEL df- 1 = 220- 17-1 = 202 
CORRECTED TOTAL df = MODEL df + ERROR df = 17 + 202 = 219 
SS MODEL = sum of squares due to regression = 23653906 

SS ERROR = sum of squares about the regression = 734963 

SS CORRECTED TOTAL = SS MODEL + SS ERROR = 24388868 
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MS MODEL = SS MODEL / MODEL df = 23653906 / 17 = 139.1406 

MS ERROR = SS ERROR / ERROR df = 734963 /202 = 3638.429 = с? 

F VALUE = MS MODEL / MS ERROR = 1391406 / 3638.429 = 382.419 

PROB>F = F distribution with 17 and 202 degrees of freedom = 0.0001 

ROOT MSE = square root of MS ERROR = 60.31939 = с 

DEP MEAN = the average of the 220 values of CONT = 1007.241 = Y 

COEFFICIENT OF VARIATION = (ROOT MSE / DEP MEAN) x 100 = 5.988577 = C.V. 
RSQUARE = SS MODEL / SS CORRECTED TOTAL = 0.9699 = R? 

ADJ RSQ 1 - (1-RSQUARE) x (n-1/n- MODEL df +1) 

D- (1-.9699X (219 / 220-17 + 1) 


1 - (.0301) x (1.0735) = 0.9673 = R,° 


| 


At this point in the planning stage, we are merely trying to determine if we have 
variables that are basic to the regression. To determine this, we look at the F VALUE 
and PROB? F statistics. If we did not have a regression, then we would not have a 
slope. As seen in equation 1.1 on page 13, the slope is equal to our B, values (for i not 
= 0). By doing an F test (with 17 and 202 degrees of freedom), we postulate a null 
hypothesis that the D values all equal 0. A high F value tends to reject this null 
hypothesis, indicating that the D values do not equal 0. The PROB>F is the actual 
level of significance, @ (actual), at which we reject this null hypothesis. What we are 
saying in this ANOVA table is that there is less than a .0001 probability of rejecting a 
true null hypothesis (Ho : B = 0). In other words, there is statistically less than 1 
chance in 10,000 that there is no slope and all of the В values equal 0. 

We will use @ (critical) = .1 as the critical level of significance when checking 
variable significance in this thesis. Since a (actual) = .0001 < a (critical) = .1, we 
continue with this data base knowing that there are some variables that are basic to 
the regression. 

To determine which variables are basic to this particular regression, one would 
look at the matrix for parameter estimates in Table 3. It, like the ANOVA table, is 
printed automatically when the MODEL statement is requested in SAS. Looking 
down the column of PROB > |T], we find nine variables that meet our criteria of à 
ШЕШІ а (critical). They are BN, ЕСТЕ, ТОТРОР, WIIIPOP, BLKPOP, 
HISPOP, QMA, ARMYMS and DODMA. This is an indication that these are the 
significant variables that are explaining this particular regression when all of the 


variables are included at the same time. 
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TABELES 
PARAMETER ESTIMATES WITH VARIANCE INFLATION FACTORS 


PARAMETER STANDARD T FOR HO: PROB _VARIANCE 

VARIABLE ODF ESTIMATE ERROR PARAMTR=0 »|T| INFLATION 
INTER 1 11842. 283 2195510657 0.540 0. 5899 0. 00 
YEAR 1 -/.058418 11.20 199200627 5515 2452 
BN 1 0. 854557 0. 341596 2 502 020132 1. 77 
RCTR 1 ТӘЛІМ 222 0216035 3.329 0.0010 11.17 
UNEMP 1 =0, 949188 2.454547 -0.387 0. 6994 1.549 
PROP 1 -0.559418 1.674842 -0. 334 0. 7387 3.44 
HSMMA 1 0.0005085386 0. 001407693 0. 361 022183 8.32 
PAYCO ] =0. 398091 2.458493 -0.244 0. 8079 4. 09 
158:109 1 =0,000150535 900005955222 222692 0.0001 101.47 
WHIPOP 1 0.0001638404 0.0000336013 4.8/6 0.0001 61.98 
BLKPOP 1 0.0001727377 .00004012063 4.305 0.0001 165241 
БЕРОЕ 1 .00007096273 . 00002216656 32201 0.0016 5.68 
INCOMPC 1 0.002052645 0. 005432364 378 02709 3.41 
E | 20.055219 02.029942 22 21.794 0.0743 8.67 
NADV 1 0.01543/ 0. 015610 Os 0. 3239 1.22 
E1PAY 1 1, 346967 0. 967061 nd 0 1652 وت‎ 
۵ھ‎ [| 3633.248 130-2 5 7229 0. 0001 1. 86 
DODMA 1 0.52 0.1 0۰01533155 0. 0001 4.57 


Finally, we look at the result of the stepwise regression in Table 4. This comes 
from the PROC STEPWISE statement in Appendix C. SAS will print a complete 
ANOVA table as each variable is entered. Table 4 is the summary of relevant 
statistics from each of these ANOVA tables, which SAS also provides. The analyst has 
chosen to use the Stepwise Procedure, as opposed to the Forward Stepwise Procedure 
or the Backward Stepwise Elimination Procedure. A summary of these procedures can 
be found in Appendix A. The Stewise Procedure indicates that there are four variables 
that are significant at the a (critical) = .1 level when only one variable is brought in at a 
time. They are DODMA, ARMYMS, RCTR and QMA. All other variables fail to 
ПІС Спе ИСТС О significance: 

We conclude this section of the model planning with the knowledge that there 
exists data that is basic to the problem. The key indicators in Tables 2, 3 and 4 have 


provided the “green light” to go ahead. 


F. CHECKING FOR MULTICOLLINEARITY 

The reason that we check for multicollinearity is because if there is a linear 
combination between the dependent variables in the X matrix (page 25), then our 
esumators will be unstable with high standard errors and we will probably calculate an 


artificially high R2 value. The R? statistic is an indicator of how well the model fits 
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TABLE 4 
SUMMARY OF STEPWISE OUTPUT FROM SAS 


STEPWISE REGRESSION PROCEDURE FOR DEPENDENT VARIABLE CONT 
VARIABLE NUMBER p Ша 


STEP ENTERED REMOVED IN Ros C(P) F VALUE PROB>F 
l DODMA 1 0.7516 0.7516 1449.37 659.45 .0001 
2 ARMYMS 2 057 0.9602 а. 0001 
3 RCTR 3 0.0038 0.9640 29.24 22.84 . 0001 
i ОМА 4 0 0010 0. 9650 24. 45 м. 0133 
5 1PAY 5 0.0004 0.9654 23 0 2.45 1188 
6 INCOMPC 6 0.0004 0.9658 23. 34 LESSE 1321 
7 BNADV 7808092 0. 9660 24. 18 (er 3010 
8 BN 85 1:001 0.9661 25. 30 0 498 71 
9 HISPOP 9 ВО ВОТ . 0. 2 26250 05521 

10 WHIPOP 10 0.0002 0.9664 20292 1.34 .2480 

11 ПОТРОР ٣۹۵ 2 7 0.88  .3485 

12 BLKPOP 12 380.008] 0.9697 9. 04 21.34  .0001 

Е YEAR ІЗ ) ۳۳ ً٘ + 7 10.49 0.56  .4566 

14 HSMMA 14 0.0001 0.9698 12.28 0.21 .6450 

15 UNEMP 15 ПЕШ) 029699 14,16 0.12 777246 

16 РКОР 162070000 655 (05 0.10  .7464 

17 РАҮСО 17 0.0000 0.9698 18.00 U UGS ОЭ 


the data. An artifically high R^ value is undesirable. А good example of 
multicollincarity (also known as collinearity) would be if the data base contained the 
measures of PERCENT MALES and PERCENT FEMALES per battalion. Clearly, 
these variables are not independent and if both were included in the regression model, 
the model would suffer from collinearity problems. 

One indicator of multicollinearity is the Variance Inflation Factor (VIF) statistic, 
which is printed in the parameter estimates matrix. A SAS request of VIF in the 
MODEL statement provides this data in the Parameter Estimate Matrix (see Table 3). 
What is important to know about the VIF is that big is bad. Numbers of around 10 
and over indicate multicollinearity. [Ref. 3:p. 416] Notice in Table 3 that there are 
several Variance Inflation Factors near or over 10. 

Table 5 shows a partial output that 1s derived from SAS using the COLLIN 
procedure in the MODEL statement of SAS (Appendix C). Another key indicator 1s 
the Condition Index. Its derivation is somewhat involved. [Ref. 4:p. 55] As with the 
VIF, a big condition number is not a good sign. A condition index of 50 or more 
implies multicollinearity is a problem and the model suffers from multicollinearity. In 
this instance, there 1s an indication that at least five independent variables appear to be 


collinear. 
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TABLES 
PARTIAL MATRIX OF COLLINEARITY DIAGNOSTICS FROM SAS 


COLLINEARITY DIAGNOSTICS ° 
ор VAR PORP 


CONDITION . VAR PROP VAR PR 
NUMBER EIGENVALUE INDEX INTERCEP YEAR BN 
1 15190 1. 000 0. 0000 0. 0000 0. 0004 
2 0. 734740 4.639 0. 0000 0. 0000 0. 0005 
3 0. 498840 و‎ 0 0. 0000 0. 0000 0. 0545 
4 0. 322976 o 9087 0.0000 0.0000 0. 0070 
5 0. 234600 8. 209 0. 0000 0. 0000 0. 0716 
6 0. 164497 9. 804 0. 0000 0. 0000 0. 4488 
/ 0.070009 14. 357 0. 0000 0. 0000 0.001 
8 0. 048103 18. 130 0. 0000 0. 0000 0. 0021 
5 0پ‎ 7 20. 626 0. 0000 0. 0000 0. 1961 
10 0. 020718 27.625 0. 0000 0.0000 0. 0404 
11 0. 017691 29.099 0. 0000 0. 0000 0. 0250 
12 0. 014183 لت‎ ٦ 0. 0000 0. 0000 0. 0035 
13 0. 007865 44. 835 0. 0000 0. 0000 0. 0304 
14 0. 006012 310 0. 0000 0. 0000 0. 0244 
15 0.004832 و‎ ٦ 0. 0000 0. 0000 0. 0268 
16 . 000486733 150 20 0. 0000 0. 0000 0. 0617 
17] 0.00007761 451. 351 00001 0. 0001 0. 0001 
18 1. 687E-08 30611 029993 0. 9985 0. 0015 


Table 6 is a printout of the correlation of estimates matrix. It is obtained from 
SAS by requesting CORRB in the MODEL statement. Its derivation is simply the 
X’X7! matrix scaled to unit diagonals. If you want to know which dependent variables 
are most highly correlated to each other, this is the place to look. Inspection shows 
that all of the population variables are highly correlated. This agrees with the VIF for 
TOTPOP, WHIPOP and BLKPOP, which also indicated a problem with these 
variables. The VIF also indicated a problem with RCTR and possibly YEAR, 
HSMMA and QMA. Checking Table 6 for these variables indicate that RCTR 1s most 
highly correlated with IISMMA (-0.4926); YEAR with PAYCO and EIPAY (0.5236 
and -0.7072); HSMMA with RCTR (-.4926); and QMA with PAYCO (0.4771). An 
arbitrary level of p > [0.4] was established by the analyst as an indicator of significant 
correlation. It is at this time that one needs to remember that the correlation 
coefficient shows only the extent to which two variables are linearly associated. It does 
not necessarily imply that there is any causal relationship between the two variables. 
Trying to figure out an explanation for the correlation between QMA and PAYCO 
could be difficult unless one was intimately familiar with the data gathering process 
and the demographics of these two variables. Even then, there may be no logical 


reason for the correlation. The only thing that is needed to know is that these two 
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TABLE 6 
CORRELATION OF PARAMETER ESTIMATES FROM SAS 


INTER YEAR BN RCTR UNEMP E1PAY 
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variables are correlated and this relationship is possibly contributing towards an error 
in the parameter estimates. This same line of thought carries over to the model as a 
whole. When we postulate a Y = P X + £ model, we are merely implying that there 
is a linear association between the carrier and the response variables, not necessarily a 
causal relationship. 

To summarize our first regression to this point, we know that there are basic 
variables to the model as proposed using the current dependent variable, CONT. 
Furthermore, the regression indicates some collinearity problems which will need to be 
scrutinized in the full development phase. With the rough indicators that have been 


derived thus far, we now need to access some preliminary goals for the model. 


С. ESTABLISHING GOALS 

When attempting to diagnose a problem using only statistical indicators, one 
must establish a standard by which results will be compared. This chapter has already 
discussed a few goals that are desired by our analysis.. A complete statement of goals 
by the investigator 1s desirable at this point so that analytical results can be quickly 


and decisively interpreted. 


I) NUMBER OF PREDICTOR VARIABLES = as few as possible. 
2) SIGNIFICANCE OF FINAL VARIABLES < 0.1 (@ critical). 
3) ROOT MSE < 20% x DEP MEANT = Re v p 


4) VIF < 8 for all variables. 

5) CONDITION INDEX < 50 for all variables. 

6) FINAL R2 VALUE - as high as possible. 

7) NO DISCERNABLE PATTERN IN THE PLOTTED RESIDUALS. 





Figure 3.1 Goals of the GSM I-IIIA Model 


With these preliminary goals as stated, the project now passes to the 


Development Phase. 
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IV. DEVELOPING THE GSM I-IHA MODEL 


In this chapter we will go into the specifics of developing the GSM I-IIIA model. 
Much reference will be made in this chapter to Figure 2.2 of Chapter 2. It may be 


useful] to review Figure 2.2 at this time. 


A. SEPARATING THE DATA 

The first regression has provided information on some of the interactions among 
the variables. In dealing with longitudinal data, there needs to be checks for both 
heteroscedasticity and autocorrelation. Presently the data contains 19 carrier variables 
(the 18 as shown in Table 1 plus the 1 to 55 numerical representations for BN) on 55 
battalions over a four year time period. It is desired to analyze this data and check for 
homogeneity without having the results biased by autocorrelation. The residuals 
contain all of the information concerning the fit of the model. Therefore, they can 
contain information on both heteroscedasticity and autocorrelation at the same time. 
By separating the data into time groups (by year) and running separate regressions on 
the individual sets of data, the effects of autocorrelation cannot be observed. 

After separating the data base, we now have four separate response and four 


Separate carrier matrices. For example, the matrices for 1982 are as shown below. 


657 53.735805, 13. fe... 1348 bo 

1585 1 2155.00 8.60 13.4 ....... 2509 bj 
Y pa p=]: 

1217 1 55 103.25 11.08 8.50 ...... 2066 ON 


where Y = 55x] matrix (a column vector of the dependent variables) 
X = 55x16 matrix (a column vector of 1's catonated with the 
55x15 matrix of the independent variables) 


B= 16x 1 matrix (a column vector of parameter estimates) 


Notice that there are now only 15 carrier variables. First of all, only the 
numerical BN can be utilized in the least squares regression so the alpha-numerical 
representation had to be dropped. Also the variables for YEAR and EIPAY had to be 
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dropped because there is no change in their values within any year across any 
battalion. Their inclusion would make the carrier matrix singular because it would not 
have full rank. 

The restructurme on the data into year groups in order to obtain the carrier 
matrices can be accomplished by SAS. As shown in Appendix C, the use of the PROC 
SORT statement will sort the data. This model uses the year as the basic time unit, so 


our option is to sort the data BY YEAR. 


B. ANALYSIS OF THE CROSS SECTIONAL DATA 
After running the time grouped cross-sectional data, an analysis is performed in 
much the same way as was done for the first regression. First of all, it is desired to 


find basic variables. A summary of the stepwise regressions by year is presented in 
Table 1 


TABLE 7 
BY YEAR STEPWISE SUMMARY OF FIRST REGRESSION DATA 


STEPWISE REGRESSION PROCEDURE FOR DEPENDENT VARIABLE CONT 


1922 1983 1984 1985 

STEP ENTERED PROB>F ENTERED PROB>F ENTERED PROB>F ENTERED ۹۷۳ 
1 | DODMA 0001 . DODMA . 0001 DODMA .0001 БОЮМА . 0001 
2 ARMYMS 0001 ARMYMS .0001 ARMYMS .0001  ARMYMS  .0001 
3 ۶ 0179 RCIR «0057 SRC TR .0244 ٣٢ -019% 
4 ٠ ۰ в Е 2 10596 vil .1894 
5 BN 200929 7 27285 +۶۶۹۹۹ ۰۶ HIPOP 40 
6 PROP ‚1657  WHIPOP .2426  JWHIPOP  .1088. BLKPOP TERETE 
7] | TOIPOP .3880 РАҮСО „2816 — TOTPOP 0425. ٣٣٣ 
8 — BEKPOP. c T10S0 7۲ .2348 . HISPOP. — 0103 НОРОР >> ٣ 
9 | INCOMPC .1429  BLKPOP  .5578 ۵۸ .0982 . HSMMA . 0570 
10 — HISPOP - .1470 SRI SPOP ol) SP ۶٢ .2981 РАҮСО . 2025 
11 ك۵‎ >1 od 6 ہ۳٣‎ .3227 BNADV . 3699 
12 Ds . ІМСОМРС .5001 BNADV .3865 BN „Эзе 
13 AYCO .4769  BNADV ×6 O . 9453 INCOMPC ٥٦ 
14  BNADV . 2207 nee ‚8059 9) INCOME Cue: ٣۳ . 4273 
15 | UNEMP 9208 SMMA Заа Е . 9852  UNEMP ‚6861 


Table 8 contains the variables, their PROB » [T] statistics and their corresponding 
Variance Inflation Factors. This information came directly from the matrix of 
Parameter Estimates with Variance Inflation Factors similar to the one displayed in 
lable 3 on page 28. 

It is time to stop and really think about what 1s happening in this model. For 
the proposed model using the dependent variable CONT, there are two dependent 


variables that are significant in every year in both the F-Test (Stepwise) and t-Test 
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TABLE 8 
BY YEAR SIGNIFICANCE AND VIF FOR FIRST REGRESSION DATA 


1982 1953 1984 1985 

VARIABLE PROB>ITI VIF РКОВ> [Т | УТЕ PROB>ITI VIF РВОВ> [1 | УТЕ 

INTERCEP .0001 0.000 .0001 0.000 .0001 0.000 .0001 0.000 
BN ‚ 0410 2.165 .4659 2.090 . 9852 2.590 . 6549 2.129 
RCTR .1435 12.832 .0312 13.889 . 0958 15.116 ‚0665 11.506 
UNEMP . 9208 1.748 ‚5432 1.592 . 5255 1.650 ‚6861 1.954 
PROP .2605 4.552 .4195 4.179 ‚3567 4.316 .5945 5.256 
HSMMA .0975 8.849 .8815 10.701 . 1460 10.558 . 1873 9.276 
PAYCO .5194 2.168 .6254 2:175 . 9660 4.086 ‚5185 5.564 
ТОТРОР ‚0087 151.139 ‚0551 106.092 .0046 120.374 .0062 125.056 
WHIPOP .0240 95.524 . 0424 65.671 . 0010 69.150 . 0007 65.525 
BLKPOP ‚0089 22.228 . 1524 18.541 .0002 18.271 . 0015 16.490 
HISPOP .0756 6.619 . 1875 6.316 .0039 6.390 .0820 5.906 
INCOMPC .2488 5.505 ‚5346 5 ГЕТ .9790 5.940 ‚4795 5.554 
ОМА 46745 6.710 . 7917 6.269 20122 21.907 .0945 21.762 
BNADV .5362 2.941 . 7566 4.064 .4405 2.400 ‚4212 5.241 
ARMYMS . 0001 1.897 . 0001 1.695 . 0001 2.203 ‚ 0001 2.052 
DODMA .0001 6.011 .0001 5.685 .0001 7.014 ‚0001 6.522 


(complete model) statistical analysis. They are DODMA and ARMYMS. There is 
now only one question that needs to be asked. Is this knowledge of any value to us? 
The answer is, probably not. First of all DODMA and ARMYMS are derived ex post 
facto. Army recruiting battalion areas are unique to the Army. Recruiting areas are 
not uniform DOD wide. Therefore, it would be difficult and time consuming to 
attempt to gather data of the proper cross-sectional structure in order to try to predict 
these variables. This would violate one of the overall objectives of this particular 
model. Secondly, since the dependent variable, CONT, is utilized to derive these two 
variables, we would expect that would all help to explain each other. This is why, in 
Table 4, over 96% of the model has been explained (model R2 = .9602) in the 
stepwise procedure after the introduction of these two variables. Similar results were 
obtained in the individual year stepwise regressions, with anywhere from в? = Е © 
1983 to R* = .981 in 1985 after the introduction of just these two variables. 

The variable RCTR is significant in every stepwise procedure (Table 7) and every 
t-Test (Table 8) except for 1982. It seems to be a good predictor. It is easily 
obtainable and, to a certain extent, controllable. It has good potential for 
predictability. One only needs to look at present and proposed recruiter manning 
rosters. RCTR, however, does seem to have significant collinearity problems. It 
exceeds our goal of VIF < 8 for every year in Table 8. Checking the Correlation of 
Estimates Table (not shown here but similar to Table 6 of Chapter 3) RCTR 1s most 
highly correlated to HSMMA in 1982 (-.4325), HSMMA in 1983 (-.5209), DODMA 
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and HSSMA in 1984 (-.5290 and -.4809 respectively) and INCOMPC, ОМА апа 
DODMA in 1985 (-.4187, -.4043 and -.4214 respectively). 

Another noteworthy factor is that WHIPOP and TOTPOP in Table 7 scem to be 
more significant than any of the other population variables. Other studies have shown 
that areas of greater multiethnic population tend to attract significantly more recruits 
than other areas. [Ref. 6] This would lead us to believe that the higher range 
concentrations of WHIPOP would possibly have a detrimental effect on contracts. We 
cannot, however, surmise anything yet as to why these two variables might be 
significant. Our model has problems with collinearity with both WHIPOP and 
TOTPOP. Both have VIF substantially greater than 8 in Table 8. Other significant . 
collinearity problems seem to be arising with HSMMA, QMA and BLKPOP. 

Unemployment is not a significant indicator at all. In Table 7 for 1982 and 1985, 
it is the least significant of all of the predictor variables. Although this 15 
counterintutive, it has also been shown in previous studies to be both significant and 
insignificant in explaining GSM I-IIIA accessions, depending upon the year and the 
dependent variable that is being studied. [Ref. 7] It may be that we are not using this 
statistic in the most appropriate manner and should be thinking about alternate 
possibilities of unemployment indicators for inclusion into the model. | 

Also, PROP is not a significant predictor. In Table 3 on page 28, the parameter 
estimate for the first regression (entire set of data) was equal to -0.559418. The 
negative sign of the parameter estimate is counterintutive (similar to the negative sign 
that we obtained with just 3 data points in Figure 1.1) This may be telling us 
something. Parameter estimates for PROP in each year group regression were positive 
for 1982 and 1983, but negative for 1984 and 1985. The dq (actual) values for the t 
statistic (Table 8) ranged from .2605 for 1982 to .3943 for 1985. All of these values are 
outside of our model goals of @ (critical) = .1. One reason that comes to mind when 
attempting to explain this may be that propensity is high in smaller markets and low in 
larger markets. Thus, although propensity may be high, it will not necessarily explain 
a high (in absolute terms) number of contracts. 

There seems to be much work that needs to be done here. The results of the first 
regression, along with the results of the first set of time grouped regressions show many 
problems, especially with collinearity. Correlation is good if it is between the carrier 


and predictor variables. It is not good if it is just between the predictors. 
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C. THESECOND REGRESSIONS 

At this time we decide to drop both DODMA and ARMYMS and rerun the 
regressions. This series of regressions will be referred to as the second regression. In 
order to circumvent the obvious problem of multicollinearity between WHIPOP and 
TOTPOP, yet still retain them in the predictor matrix, a new variable is adopted. This 
new term, PERCWI (for percent white) is merely the WHIPOP divided by TOTPOD. 
In SAS, this 1s easily produced by the algebraic equation immediately following the 
INPUT line (Appendix D). Also dropped is the QMA variable. QMA was displaying 
some problems with collinearity. In looking at Appendix B, it is noticed that QMA 15 
usually derived as a straight percentage of TOTPOP and only updated once every other 
year, whereas HSMMA 1s a number based on actual counts that are performed by 
recruiters and verified at certain non-specific time intervals by the Area Recruiting 
Zone (ARZ) verification teams. All else being equal, HSMMA is a prefered statistic 
because of its perceived accuracy. Since QMA and HSMMA are closely related, and 
since there is also a problem with collirearity in the HSMMA vaniable, it is anticipated 
that dropping QMA might help to alleviate this collinearity problem with HSMMA as 
well. 

The results of the second regression are only slightly encouraging. Tables 9 and 
10 present the summary of the second regression results for the overall and year 
grouped data bases. The regressions modeled 13 dependent variables versus CONT. 
The far left column of Table 10 lists the independent variables used in these 
regressions. These tables present the results as compared to the preliminary established 
goals of the model as outlined in Figure 3.1 of Chapter 3. 

The RZ values all fell substantially, but this was to be expected after dropping 
the two derived variables, DODMA and ARMYMS. The t statistic indicates that 
RCTR is significant in every year, as does the stepwise regression procedure. The new 
variable, PERCWI, is significant in every year with the stepwise procedure. 
Furthermore, none of the population parameters are showing any signs of collinearity 
problems. UNEMP and PROP, two variables that have been historically. good 
indicators, are significant in some years, but not in others. The VIF and Condition 
Index (C.I.) indicate multicollinearity, especially with RCTR and HSMMA. Until this 
problem can be solved, many of the key indicators are suspect in their accuracy. 

There are several issues that arise from the second regression. The first 1s the 


question of why BN would be a significant variable. BN is merely an ordinal number 
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TABLE 9 
SECOND REGRESSION RESULIS VERSUS ۳ 


1982 1983 1984 1985 
R^- йз» ‚ 75 Med . 60 
VARIABLES RCTR RCTR RCTR RCTR 
WHOSE PROP PROP PROP PAYCO 
PROB>] T| HISPoOP HISPOP 6 BNADV 
WAS UNEMP 
о PERCWI 
BN 
BNADV 
VARIABLES 
С: [> 50 2 25 1 1 
( TOTAL ^4 
VARIABLES RCTR RCTR RCTR Е 
w/VIF » 8 HSSMA HSMMA 
C. V. «20 YES YES YES YES 
TABLE 10 
SECOND REGRESSION STEPWISE RESULTS 
FOR VARIABLES WITH PROB» F < 0.1 


PARAMETER ESTIIMATES OF oleh EE te Ahi Vy Aristo 
FROM STEPWISE REGRESSION 


1982 1983 1984 1985 

BN Заа Б = 2262 

RCTR 7 001 1ٍ1 3 10203 8 
UNEMP 2656 2505 - - 
EROR 17/256 2736 16.84 B 
HSMMA - - - - 
0 E - - = 

PERCWI 1292 208.4 18956 283.3 
BLKPOP - - - - 
SEOP -17Е-5 -27Е-5 -17Е-5 = 
INCOMES = = - = 

BNADV . 3034 - - . 1468 


given to the alpha-numeric battalion names. One must be very careful when using 
substitute ordinal level data in a regression equation. In this instance, however, it is 
signifying an interesting phenomenon. Why does the mere battalion name signify 
contracts? Part of the answer has to do with the concept of lurking or latent variables. 


As stated previously, there is no possible way in which one can collect numerical data 


on all possible aspects of the recruiting process. There are many undefinable or 
uncaptureable nuances that lead to the decision to enlist in the Army. Intangables 
such as leadership within the recruiting battalion, a wealth of overachieving recruiters, 
favorable local school officials or the mere history of being the ‘best’, “worst” or an 
“also ran” battalion may have significant impact. The fact that BN 1s showing up as a 
significant variable implies that battalions are doing the way they are just because they 
are named that battalion. In an attempt to capture this phenomenon and to discard 
the substitute numbering system for the battalions, the analyst checked several 
indicators of battalion output history over the four years covered by this study. 
Instead of merely using the (constant interval) BN number, another variable was 
contemplated that would more readily capture the 'spread' between the battalions. 
After several trials, the variable BNPER (meaning battalion percent) was adopted. It 
is the number of contracts signed by a battalion in a particular year, divided by the 
total number of contracts signed in that year. For example, BN IA signed 657 
contracts in 1982. There were a total of 51,431 contracts signed in 1982. Therefore, 
BN IA is given a new variable of 657/51431 = 0.0127744. In looking at all of the 
battalions over all of the years, the standard deviation of this indicator is less than one 
third of its mean and it is fairly normally distributed with no significant skewing. Some 
battalions are always near the top percent of total recruits, and some are always near 
the bottom.. This variable allows the analyst to control his inputs at the battalion level 
based on his knowledge of a particular unit. For instance, although a particular 
battalion usually recruits about 2.5 % of the total mission, a leadership change or a 
high recruiter turnover rate or a particularly disastrous local situation may force the 
analyst to decrease that number and re-distribute it to another more favorable location. 
Or, some demographic phenomenon may lead to an entire region (or Brigade) having 
their inputted numbers shifted. If this much detail is not desired, we can merely plug 
in the percent of total mission that has been assigned to that unit as a result of the 
latest Enlisted Personnel Model (EPM) run. 

There are some valid concerns with using proportions as predictor variables. 
First of all, their average value will never change (it will always be 1.00 / total number 
of battalions in this case). Secondly, this particular variable could not be used with the 
dependent variable, CONT, because they are linear functions of one another. It would 
be just like artificially plugging in equalities on both sides of the hypothesized linear 
regression equation. We are still, however, in the trial and error mode, so maybe we 


will be able to utilize this new variable in a future regression run. 
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The second issue is that PROP is now becoming a significant variable. As stated 
earlier, it is speculated that propensity may be more of a proportion indicator than an 
absolute value indicator. This might be due to higher propensities in smaller market 
areas and vice versa. In Table 10, it is now seen that PROP has all positive parameter 
values. The reason that PROP would now have all positive parameter values when in 
the first regression, it had both positive and negative values has to do with the concept 
of the costock. [Ref. 5] In speaking of the costock of a independent variable, we are 
refering to all of the other independent variables in a particular regression. For 
example, if we were modeling CONT versus RCTR, UNEMP and PROP, the costock 
of PROP is RCTR and UNEMP. The thing to remember is that the value of a. 
parameter estimate of a particular independent variable may have more to do with the 
data values of its costock than it does with its own data values. In other words, as given 
in the example above, the derived parameter estimates for PROP may be more a 
function of the data values of RCTR and UNEMP than the data values of PROP 
itself. | 

With this in mind, we look at another aspect of the second regression. In Table 
9 and 10 we notice that there are different significant variables in different years. Asa 
matter of fact, there are no two years in which the significant variables are the same. 
We know that the costock has a lot to do with the values of a particular regression 
equation. All else being equal, we would certainly prefer that the regression equations 
for each year contain the same variables at the same level of significance. If this were 
to happen, we could compare parameter estimates with some degree of validity. One of 
the largest abuses of regression analysis is when an attempt is made to try to compare 
parameter estimates that have been derived from two different regressions using two 
different costocks. These types of comparisons are not valid. 

Finally, the second regression 1s somewhat unstable across time periods in the R? 
values that are achieved (see Table 9). These R^ values are not necessarily bad, but 
since we are building a predictive model, a higher R value is prefered. We are not 
sure just how high of an R? value can be obtained from this particular data base. If 
there are any ties in the data values of a particular independent variable in the carrier 
matrix, the R^ value can never attain unity. This is because the regression hyperplane 
would be trying to fit itself through the two different points in the same plane, which 
cannot be done. This phenomenon is known as pure error. If pure error is present in 
a data base, the R? value can never be 1.0. We do not know how much pure error is 


present in this regression, but higher R^ values will be prefered. 
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D. THE THIRD AND SUBSEQUENT REGRESSIONS 

A third regression is now planned. In order to check the PROP variable against 
Our suspicions that it 1s a proportion indicator, we contemplate changing the dependent 
variable. Again, we must remember that the overall goal of the model is to predict 
total GSM I-IIIA contracts. Perhaps a dependent variable of CONT/TOTPOP or 
СОХТ/ОМА would give us some indication of the proportion of a specific population 
that a recruiting battalion is actually enlisting. One term that is utilized by the 
recruiting community is that of Penetration. Penetration is the proportion of contracts 
that are signed per the market of GSM I-IIIA available. We adopt the term PENT, 
which equals CONT/HSMMA. This looks to be an ideal response variable because we . 
have seen that there 1s definitely collinearity between HSMMA and the other predictor 
variables (see Table 9). By putting HSMMA on the response side and dropping it 
from the predictor side, we expect to decrease the problem with multicollinearity. Also, 
we can now utilize the variable BNPER since there is no longer a strict linear function 
between it and PENT. Since this is an entirely new approach with a new dependent 
variable, we will keep all of the other carrier variables for this regression. 

The results of this regression are much more encouraging. Tables 11 and 12 
present the summary of the third regression results for the year grouped data bases. | 
The far left column of Table 12 lists the independent variables used in these regressions 


versus the dependent variable PENT. 


., 88588282 
1۲۱۱۱ ۱١۰۱٦٦۱۹۹۰٦٦1۶۹5 107۸۷ RESULTS VERSUS ESTABLISHED GOALS 


1982 1983 1984 1985 

р2- ШЕП . 85 . 84 EU 

VARIABLES PROP PROP PROP PROP 
WHOSE BNPER BNPER BNPER BNPER 

PROB» |T| RCTR RCTR RCTR RCTR 
WAS BLKPOP BLKPOP PERCWI  PERCWI 
< 1 INCOMPC INCOMPC 

VARIABLES 

iG. T1.» 50 2 2 1 1 

(TOTAL ++) 

VARIABLES Ë RCTR е р 

w/VIF » 8 

C. V. «20 YES YES YES YES 
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TABLE gB2 


THIRD REGRESSTON SIE 


SS PWISE RESULTS 
FOR VARIABLES WITH PROB» FP « 0.1 


0. 


PARAMETER ESTIMATES OF SIGNIFICANT VARIABLES 
FROM STEPWISE RECRESSTON 

1992 1993 1984 179m 
EROR llE-4 lOE-4 9OE-5 l3E-4 
BNPER 2099 3-290 28000 25253 
ROIR -62E-5 -71Е-5 -54Е-5 -45Е-5 
PERCWI -13Е-3 -27Е-3 -53Е-З -28Е-3 
UNEMP - - - - 
PAYCO = = = - 
BEXFOR 23 -15Е-9 = = 
HISEOR - - | - - 
INCOMPC = -25E-7 -16Е-7 -24Е-7 
BNADV = = - - 


As compared with Table 9, the R2 values have increased for most years and are 
more stable. There is more stability in the variables across the years in that PROP, 
BNPER and RCTR appear in every year using both the t-Test and the stepwise F Test. 
PERCWI also shows up every year in the stepwise procedure. There is only one VIF 
greater than 8, and that 1s for RCTR in 1983. There is still collinearity problems in 
every year according to the Condition Index numbers. 

Checking for collinearity in the Correlation of Parameter Estimates Matrix for 
this regression (see Table 13) it is noted that there are several variables that indicate a 
p > |0.4]. Our collinearity problems are very probably arising with one of these 
relationships. Since UNEMP, PAYCO, HISPOP, BNADV and EIPAY are not 
significant in any year in Tables 11 and 12, these are the first candidate variables to be 
dropped in the next regression attempt. Checking these variables against Table 13, it 1s 
seen that UNEMP, PAYCO and BNADV are not highly correlated with any other 
variable, HISPOP is negatively correlated with RCTR (-0.4331), and EIPAY is 
correlated with PROP and INCOMPC (-0.4481 and -0.6207 respectively). 

Dropping these five insignificant variables and running a fourth regression still 
indicated a condition index greater than 50 for one variable. Since BLKPOP and 
PERCWI are highly correlated (p > .7), these are the two suspect variables as to the 
probable cause of this indicator of multicollinearity. In trying to determine which of 
these variables to drop, it is decided that BLKPOP should go because it has been 


shown to be the least significant in more years than PERCWI. 


42 


TABLE I3 


CORRELATION OF PARAMETER ESTIMATES FROM SAS 
FORTHE THIRD REGRESSION 
CORRB INTER PROP BNPER КСІК | HISPOP INCOMPC 
INTERCEP 1.0000 (0,349 m - 0. 0150 00992 =0. 143] 0. 4520 
PROP 0. 3493 т 0 17265 0.4271 -0. 0858 0. 4853 
BNPER ZU 0150021855 оте” 5545 (0,2929 0. 0071 
RCTR 0. 0992 ШИТІ 90-6523 0۰00 4ٰ ۹ 1 
۶ھ 0.3454 0.3369- 0.0952- 0.1614- 1230 .0= ػ۳‎ 
UNEMP E0517 0.1821 -0.2426 0ھ‎ 476 0. 2350 
PAYCO 0. 0200 ОЛИО ИОС 56750.0291 0.2443 
EE UP 0.2060  -0. 5186 Bene so Ü 500 (27322 U 0208 
Шәррееге( 14347 -070858 росе № -0. 4331 0-10 
INCOMPC 0.4520 0. 4853 ۸6-0 1. 0000 
BNADV BUT UA м 10-02 0.1673 -0. 1193 
CORRB PERCWI UNEMP PAYCO ESO BNADV 
INTER B0125 00. 0517 0۰0۰000۳2060 -0. 1457 
PROP -0. 1614 ПАТ ٠۰۹۰۷۷۱۰۹۷ 1۔۶6‎ 
BNPER =0 092 26 0.1169 021036 20 1208 
RCTR - ۲ 8۸۰٣۰٠٠٠ 0076 -0 5566 -0 3229 
PERCWI 1. 0000 0 0655 -0, 0043 07-723 ТІ 
UNEMP DOS 1.0000  -0.2970  -0.0648 0.1453 
PAYCO 20.004 5222202370 120, 00 zZ 005 237°=0..094] 
BLKPOP ت8‎ 300 58 1 000. 0.1129 
HISPOP 0. 3454 0 0915 71 Ux A 1673 
MICOMPE 0.1932 U: 2350 0.2443 -0.0208  -0.1193 
BNADV 0.1231 0.1438 -0.0941 0.11 0000 


Now a fifth regression was run. The independent variables were PROP, BNPER, 
РЕЙС РЕПСҮУТ and INCOMPC. The dependent variable was PENT. For every 
year except 1985, INCOMPC was the last variable to enter the stepwise regression. [t 
was also an insignificant variable in 1982 according to the t-Test. Every other variable 
for every other year was significant for both tests. There was, however, still a 
collinearity problem. A single condition index of greater than 50 was noted for every 
separate year regression. 

Several combinations using four of the five independent variables listed above 
were then tried. This is because one of our goals in this model is to use as few 
predictor variables as possible. It must be remembered that for every variable that is 
included in the model, the analyst must take the time and effort to predict that 
variable. It is hoped that a combination of four could be found that was 'as good as' 
the above combination of five. Any combination chosen had to meet all of the goal 
Шота 45 5ЕГ forth in Figure 3.1. Finally, one ‘best equation’ was chosen. It was 
decided that INCOMPC could be dropped with no substantial loss to the model. This 
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was determined when checking the partial R? values as given in the stepwise summary 
(similar to Table 4). The partial R values for INCOMPC ranged from 0.0001 in 1982 
to 0.03 in 1985. These added values to the overall R2 were considered insignificant. 
The dropping of this variable also solved the condition index collinearity problem, with 
the highest index value being 32.16 for 1982 which 1s well below our goal of 50. 

Before moving on a few issues need to be addressed. Although we have named 
the regressions first, second, etc., this is really a misnomer. There have actually been 
scores of regressions run to this point, each checking a different aspect of the problem 
or verifying the intuitions of the analyst. One can do this to the point where the data 
tends to dictate the “next move” of the analyst. If this happens, we will end up witha . 
model that will only fit the data that 1s contained in the data base. A predetermined 
set of goals (such as Figure 3.1) tends to counter this problem. Also, the validation 
phase contains provisions to check the model with different data to assure the model's 
validity. 

The most notable work with the other regressions was with the unemployment 
variable, UNEMP. For the time span of this study, UNEMP was not a significant 
variable except for a few regressions, mostly in 1982. This 1s counterintutive to most 
USAREC analysts. An attempt was made to transform this variable in two distinct 
ways. 

First of all, a variable called CHUNEMP was attempted. This variable was 
actually the change in unemployment within a battalion between years. This was 


derived by using the following formula. 
CHUNEMP, = (UNEMP, - UNEMP, j) / UNEMP, , 
where t = 1983,1984,1985 


This variable did not prove to be any more significant than the UNEMP variable. 

Also, a dummy variable was defined as a battalion either being above or below 
the average national unemployment as calculated by the Bureau of Labor Statistics. Ц 
was hypothesized that although perspective accessions might not be familiar with their 
particular unemployment rate, they could be cognizant of whether they werc in an area 
that was higher or lower than the national average as reported in the local media. This 


dummy variable also did not prove to be significant. 
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The only logical explanation for this 1s that the costock of UNEMP is carrving 
the signal from UNEMP. It is thought that PROP is the predominant carrier of the 
signal since PROP is the most significant variable in all of the regressions and is a 
variable that is designed to capture several signals that may or may not be otherwise 
measured. 

The bottom line at this point is that although this study has discussed five 
regressions to end up with four variables, the trials and thought processes that have 


actually taken place significantly exceeds that which 1s discussed in the text. 


Е. CHECKING FOR LEVERAGE 

In regression model building, one should check every regression equation for ` 
possible lack of fit due to outliers. Outliers may cause an effect called leverage which 
can cause a significant decrease in R2 values. 

One method of finding outliers is to look at the “studentized” residuals. These 
residuals are produced when a P or R is requested in the option section of the 
MODEL statement in SAS (see Appendix D). Studentized residuals are merely the 
actual residuals that have been set to a normal distribution with a variance of one. 
Therefore, we would expect their values to range from about -3.0 to +3.0. With the 
sample size of 55 battalions per year that we have, we would expect that approximately 
two residual values per year would exceed |1.96]. Looking down the list of studentized 
residuals in Table 14, we notice that there are two residuals that are outliers in 1982 
nd 0J) eleven in 1983 (3B, 3D, 3E, 3F, 3G, 3H, 3J, 3K, SA, 5B, 6G) none in 1984 
and one in 1985 (6G). These battalions should be rechecked to insure that their 
underlying data base is accurate. If it seems to be proper, the analyst should attempt 
to explain the deviation that these samples are displaying. 

Another more powerful indicator of lack of fit due to leverage is the Cook's D 
statistic. [Ref. 4] It is also located in Table 14. It measures two things at once. 
Cook's D will get large when (1) the residual gets large and (2) when there is an outlier 
data point that is lying outside of the data cloud in the carrier hyperspace and 15 
exerting some leverage on the regression plane. In Table 14, we notice that the Cook's 
Desiatistic IS significantly larger in 1982 for GE; in 1983 for 3B, 3D, 3K, 5A, 6E and 
6G; in 1984 for IN, 6E and 6G; and in 1985 Гог 6С апа ОН. 

Discarding data from the data base is a judgement call on the part of the analyst. - 


One should never discard data from the data base without significant reason. The 
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biggest perpetrators of lack of fit for this model seems to be battalions 6E, 6G and 6T. 
(nis the judgement of the analyst to discard 6E and to keep the rest. The reasoning for 
this is that battalion 6E represents Honolulu, which is an extreme point in almost every 
statistical variable that is included in the model. Also, its actual contributions to 
contracts (approximately one-half of one percent) is negligible. In consulting with 
experienced USAREC analysts, Honolulu (along with San Juan, P.R.) are seldom used 
in other regression. models due to their peculia demographics and unique 
characteristics. 

On the other hand, 6G and 6F represent the Phoenix and the Portland battalions. 
Phoenix is undoubtedly and outlier due to its low PERCWI value and Portland due to. 
its low PROP value. In any event, their exclusion, is not deemed appropriate due to 
the fact that they contribute significantly more total contracts than does Honolulu. In 
fact, their inclusion (with associated range of carrier variables) may tend to add to the 


robustness of the model. 


F. THE FINAL REGRESSIONS 

After discarding the values for battalion 6E and rerunning the regression, an 
across the board increase in R? values is obtained. Partial R? increases ranged from 
.0034 in 1985 to .0232 in 1983. 

Table 15 shows the pertinent regression statistics for the final regression of 19$5. 
Other years were nearly identical. In every year the stepwise procedure brought in the 
variables in the same order (PROP,BNPER,RCTR then PERCWI). Tables 16 and 17 
display the results of the final regressions which determined our 'best separate 
equations. A detailed discussion of these result will be provided later in the text. 

Notice that every variable is significant in each test in each year (each was 
significant at the 0.0001 level). All parameters are equivalent in magnitude and signed 
the same. The regressions are stable across time periods and indicate fairly good R? 
values for cross-sectional data. Since they each contain the same costocks, their 
parameter estimates are comparable. We are satisfied that these regressions have 
achieved our preliminary goals as specified in Figure 3.1. It 1s now time to check the 
underlying assumptions of multivariable regression analysis to insure that these 


equations are valid. 
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SAS 
YEAR-1985 


TABLE I5 
FINAL RESULTS OF 1985 YEAR GROUR REGRESSION 
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aH p^ +: СТК + рд ‚PERCWI + 5, 
1982, 1985, 1984, 1985 


t 


PENT, = Bo, * f, ,PROP - f» ,BNPER 


wherc 
These equations were derived under the assumption that the residual errors are 


CHECKING FOR HOMOGENEITY IN THE RESIDUALS 


Our 'best separate cquations' to this point are of the form: 
independent, that they have a mean of zero, that they have a constant variance (known 


G. 


TABLE 16 
FINAL REGRESSION RESULTS VERSUS ESTABLISHED GOALS 


1982 1983 1984 1985 

R2- TOS . 83 2 ‚ 74 
VARIABLES PROP PROP PROP PROP 

WHOSE BNPER BNPER BNPER BNPER 
PROB>] T| RCTR RCTR RCTR RCTR 

«МАЗ. PERCWI PERCWI PERCWI PÉRCWI 
VARIABLES 

(т. > 50 0 0 0 0 
(TOTAL Es 
VARIABLES s 2 £ = 
w/VIF » 8 
C. V. «20 YES YES YES YES 

TABLE 17 
FINAL REGRESSION STEPWISE RESULTS 
FOR VARIABLES WITH PROB» F « 0.1 


PARAMETER ESTIMATES OF SIGNIFICANT VARIABLES 
EROMESTEPWISE REGRESSION 


T952 1983 1984 565 
EROP 11Е-4 16Е-4 12Е-5 15Е-4 
BNPER 2559 2 27/6 و‎ 2 565 
БЕЛІК -56Е-5 -69Е-5 55Е-5 -51Е-5 
PERCWI -58E-3 -/4E-3 -66E-3 -5lIE-3 


as homogeneity) and that they conform to a normal distribution. Heteroscedasticity is 
where the model fails to meet the assumption of constant variance. The easiest 
method of checking these regressions for heteroscedasticity is by plotting the residuals. 
The most common residual plot is the plot of the residuals versus the predicted 
values. The reason for this is because the covariance between the residuals and the 
predicted values is equal to zero. The procedure PROC PLOT in Appendix D 
indicates how to get these residual plots from SAS. Each individual year has to be 
generated and checked. Figure 4.1 is.the graph of the residuals versus the predicted 
values for the year 1985. This is actually a three dimensional graph in that the plotted 
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SAS 
YEAR-1985 


PLOT OF RESIDI*YHATI SYMBOL IS VALUE OF BNN 


| 6 
0. 015 
| 3 
0. 010 3 
4 3 
4 5 
5 46 3 4 
0. 005 4 
1 5 5 5 ۾‎ 
5 5 5 
0 4 3 
0. 000 +------ ----------2----------------------------------- 
1 1 4 
51 3 
1 5 1 
3 
-0. 005 
1 
6 6 
6 4 
-0. 010 1 3 
6 
| 4 
-0. 015 6 
6 
---+-------+-------+------- +-------+-------+------- +-- 
0.03 0.04 00 00 0.07 0.08 0.09 
PREDICTED VALUE 
NOTE: 1 08$ HIDDEN 


Figure 4.1 1985 Plot of Residuals vs Predicted Values 
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data points indicates which battalion 1s being plotted. The resolution of SAS is only 
down to the first number of the battalion (a plot of 1 can indicate from battalion 1A to 
IN), but it can give a quick indication of which general region is contributing the most 
to the error in the model. In this particular graph, most of the l's are lying below the 
zero reference line and most of the 5’s are lying above. This quickly gives us an 
indication that the First Brigade is below the regression plane for Penetration and 
Fourth Brigade is lying above. Similar results were obtained for the other three years. 
There is no discernable pattern in this year (nor were there in any other years) and we 
can tentatively conclude that there is no heteroscedasticity within the year groupings. 
Plotting the residuals against the predicted values is not the only plot that can or 
should be used. Plotting the residuals against the independent variables can give some 
indication as to whether a transformation of the variables is needed. If the residuals 
plot out in a megaphone type shape (close to each other on one side of the graph and 
spread apart on the other) then there is a problem with constant variance. If this 
pattern 1s apparent, then a transformation on the response variables may be needed or 
a weighted least squares regression method is required. [Ref. 3:p. 148] An archlike 
pattern may indicate the need for extra terms (such as a quadratic). Figure 4.2 shows 
such a plot for each independent variable for a different year. Appendix D specifies 
how to produce these plots from SAS. Again, each plot in each year must be checked. 


These plots indicated no discernable pattern and heteroscedasticy is not indicated. 


H. CHECKING FOR NORMALITY IN THE RESIDUALS 

One of the most important indicators that the model is correct is in the checking 
of the residuals for normality. This is an initial assumption for the derivation of the 
regression equations and is crucial for the validity of using F-Tests as key statistical 
indicators. Furthermore, if there is no discernable pattern in the residuals and if the 
residuals can be shown to follow a normal distribution, then there is no graphical or 
Statistical indication that heteroscedasticity is present in the proposed models. 

One of the quickest methods of checking for normality is to plot the residuals 
and visually determine if the pattern follows a normal bell-shaped distribution. SAS 
can accomplish this using the PROC CHART statement as presented in Appendix D. 
The output for this procedure for 1985 is as shown in Figure 4.3. This figure tends to 


support the assumption of a normal distribution, as did the charts of the other years. 
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Figure 4.3 Graphical Inspection for Residual Normality - 1985 


MIDPOINT 

RESID1 RESIDUALS FREQ CUM. PERCENT CUM. 
FREQ PERCENT 
-0.019 | 0 0 0.00 0.00 
-0.017 | 3exxx 1 1 1.85 1.85 
-0.015 | AA 1 2 1.85 5.70 
-0.013 | 0 2 0.00 5.70 
-0.011 | HHH HHH HE HHH HH HEM 3 5 5.56 9.26 
-0.009 | FEE SEH HEHE HEHE HEHEHE HEHE HEHE HEH HH HK HK HHH 5 10 9.26 18.52 
-0.007 | HHH HH HH HH HHH HH 3 13 5.56 26.07 
-0.005 | xxx 1 14 1.85 25.93 
-0.003 Ж ХХ 3 4 18 7.41 55.55 
-0.001 | 3€ 3€ XC XC3€ HEHEHE HEHEHE HE HHH HEH HHH HH HHH HHH 6 24 11.11 44.44 
0.001 | E E E € ЕЕ € 3€ € 3€ 3€ € 3€ 3€ EE E E E 434 44% AAA K 7 31 12.96 57.41 
0.005 << Хх ж HEHE HE HEHEHE HEHE HEHE HEHEHE HEHE HHH HH HHH HHH HHH HHH 8 39 14.81 29.22 
0.005 | 3€3€3€3€3€ 3€ 9€3€ 3€ HEHEHE HH HH HHH 4 43 7.41 79.63 
0.007 | XWXXXXXXXXXXXXXXX%XX% 4 47 7.41 87.04 
0.009 | EK KK KKK E EK KK 3 3K 4 51 7.41 94,44 
0.011 | 3823636363 1 52 1.85 96.30 
0.013 | 5% 1 55 1.85 98.15 
0.015 | 0 55 0.00 98.15 
0.017 | ee 1 54 1.85 100.00 
0.019 | 0 54 0.00 100.00 

----- +----+----+----+----+----+----+----+ 
2 5 4 5 6 8 
FREQUENCY 


We can use a Chi-Squared goodness-of-fit test to further support the hypothesis 
of a normal distribution. The null hypothesis is Ho : The residuals are distributed 
Normal (0, с^). The results of the Chi-Squared test for each year are as follows: 


1982 - a (actual) = .262 
1983 - а (actual) = .580 
1984 - а (actual) = .527 
1985 - u (actual) = .319 
Since these values of a (actual) are greater than « (critical) = 0.1, we fail to 


reject the null hypothesis that the residuals for each year group are normally 
distributed. 

To summarize the progress on the planning and developing of the GSM I-IIIA 
model to this point, the following steps have been accomplished. 


1) First regression run. Basic variables present. 


5» 


2) Data separated into time groups to nullify effects of possible autocorrelation. 

3) Subsequent regressions to reduce the effects of multicollincarity. 

4) Subsequent regressions to determine significant variables per time group. 

5) Subsequent regressions to determine final ‘best separate equation’ per time group. 
6) Check for leverage from insignificant outliers per time group. 

7) Plots of residuals. Visual check 1n each time group for heteroscedasticy. 

8) Check for normality in each time group using charts and statistical tests. 

It is now time to repoo! the data back into its original longitudinal structure. 
The data set has the same basic structure as in Table | on page 12, except that now we 
will be working with only the four independent variables that were found to be 
significant in the cross-sectional analysis. 

Another regression is performed using these four variables. An overall R? value 
of 0.7171 1s obtained. As expected, each of the variables in the individual year groups 
is significant in the overall regression using both the t-Test and the stepwise F-Test. 
Again, multicollincarity is not a problem as the Condition Index and Variance 
Inflation Factors are well below the model goals. It is now time to check the residuals 


of this overall regression for any signs of autocorrelation. 


Г CHECKING FOR AUTOCORRELATION 

Autocorrelation is a problem that sometimes arises with time series data. 
Positive autocorrelation tends to underestimate the standard error of the estimated 
coefficients and could lead to an indication of significance (1.e., slope not = 0) when 
actually the coefficients are not significant. 

Once the data 1s restructured and the regression 1s accomplished, one of the first 
indicators for autocorrelation is for the residuals (in the overall regression) to become 
non-normal. Їп our particular model, we will now be checking a total of 216 residuals 
(54 battalions x 4 years) for normality. This is quite a large sample size to be trying to 
determine a goodness-of-fit for any known distribution. If the statistical indicators 
come out to confirm a normal distribution, it would be a very good sign. If not, it 
could be due to the sample size or it could be the fact that the residuals are carrying 
certain biasing information concerning autocorrelation. There are several methods to 
check for autocorrelation which will be covered in this section. 

The results of a Chi-Square goodness-of-fit test for the re-pooled residuals 
indicate an @ (actual) equal to .055. This is less than @ (critical) so we fail to accept 


the hypothesis that the residuals are distributed normally. This is the first bad sign. 
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One very quick way of checking for autocorrelation is to look at the residual 
plots in Table 18. These plots are given by SAS when the request of R is indicated in 
MWe’ Option section of the MODEL statement. These are actually plots of the 
studentized residuals (similar to those presented in Table 14) of the overall regression. 
A residual that is within 0.5 standard deviations of the mean is left blank; between 0.5 
and 1.0 standard deviations gets a single *; between 1.0 and 1.5 gets **; and so forth. 
When checking for autocorrelation, we look for patterns in these residuals. A graphical 
example of this is given in Table 18. The GOOD is a hypothetical example that is 
presented for illustrative purposes. The BAD are selected segments of actual results 
from our newly (repooled) postulated model. Notice that there is a distinctive pattern 
of a definitive series of positive or negative residuals in the actual (BAD) results. What 
we are looking for is something similar to the GOOD results where there is a seemingly 


random shift between the positively and negatively plotted residuals. 


TABLE 18 
PrOTSOLSTUDENTIZED RESIDUALS FROM SAS 


|--- THE GOOD ----| |--------------- THE BAD ---------------- | 
BN -2-1-0 12 BN -2-1-0 12 BN -2-1-0 12 
UU * 3C хх бА * 

UU ^ 3C 6A 

UU 36 * бА 

UU * 3C жж бА x 

VV 30 GF x 

VV * 3D жхххх 6Е 

VV * 3D GF 

VV 3D GF x 

WW 3E * 6G г 

WW * 3E Жжжж 6G ЖЖЖЖ 
WW * 3E 6G ×× 
WW 3E хх de жж 
ХХ f ЗЕ 6H ^ 

XX * ЗЕ жж 6H * 

XX * ЗЕ ^ 6H x 

XX ^ 3F * 6H жек 

YY ххх 3G 61 x 

YY ۸ 3G ххх 61 

үү * 3G 61 

YY 3G x 61 хк 

7 * 3H 6) | **** 

7 жж 3H xk kk 6J * 

77 * 3H * 2 x 

27 * 3H "LOW A 6 кек 


Another graphical method is provided by SAS and is shown in Figure 4.4. The 
PROC PLOT procedure is again used. This time we will plot the residuals from one 


year versus the residuals of the previous year. The idea is that if autocorrelation is nor 


3 


present, then the only discernable pattern should be a cloud of residual plots centered 
around the (0,0) coordinate. Otherwise, we can assume that the two plotted residuals 
are pairwise correlated and therefore not independent. Figure 4.4 does not look very 
promising. The fact that many negative residuals are being plotted against other 
negative residuals, and many positive residuals are being plotted against other positive 
residuals indicates that positive correlation is very probable (negative correlation would 
have been centered on the complimentary northwest to southeast axis). 

One should seldom rely on graphical methods alone, however. Another test that 
is easy to perform is the runs test. It is a simple non-parametric test based on 
probability theory. Reference is made to Figure 4.5. 

Our data is structured over a four year time period. If we place the residuals for 
each battalion in a row over this four year period and if these residuals are independent 
and randomly distributed we would expect them to fall in a distribution that is similar to 
the distribution that is depicted at the bottom of Figure 4.5. In Figure 4.5, if we have 
four columns of residuals (Where each column equates to a year) and each residual can 
be either positive (+) or negative (-), then probability theory indicates that there are 
16 different ways (24 combinations) that these four columns of positive and negative 
residuals can be arranged. By looking at the actual arrangement versus the theoretical 
arrangement, we compare to see if there is independence or non-independence. 
Independence 1s indicated if the distributions are statistically identical. Too few runs (a 
run being defined as a string of positive or negative residuals) indicates a positive 
autocorrelation between the year groups. This means that the variables in one time 
period will be high if the variables in the previous time period were high and low if the 
previous time period were low. Too many runs indicate that there is a negative 
correlation and that one year’s highs will cause the next year’s to be low, and vice 
versa. 

In looking at Table 19, our overall analysis of the regression residuals indicate 
too few runs. This signifies positive correlation. By inspection, the actual cumulative 
probability distribution in Table 19 is not identical to the theoretical cumulative 
distribution in Figure 4.5, therefore the residuals are not independent and 
autocorrelation is possible. This supports our observations from the BAD. 

One final check could be the Durbin-Watson Test. It is the most popular of the 
autocorrelation tests. The Durbin-Watson test is a test which postulates a hypothesis 


that there is no correlation in the residuals (Ho: р = 0 between adjoining periods). 
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SAS ; 
PLOT OF RESIDUALS FOR 1984 VS RESIDUALS FOR 1985 


* * 





RESIDUALS FOR 1985 


Figure 4.4 Plot of Lag-One Residuals for 1984 vs 1985 from SAS 


Sy 


POSSIBLE COMBINATIONS = 2% = 16 


(**** SIGNS ***) RUNS 

1 + + + + 1 

2 t + + = 2 

3 v as = - 2 

4 + + - + d 

5 + = + + 8 

6 + = = + а 

1 + = - - 2 

8 + = + = 4. 

9 - - - - jJ 

10 - - - * 2 

1 - - * * 2 

12 - - + - 3 

> - + = = 3 

14 - + + = 3 

15 = E + + 2 

163 = + = + 4 

RUNS 1 2 © 4 

FREQUENCY 2 6 6 2 

PROBABILITY 125 375 MS Б 25 

CUMULATIVE 025 . 50 2579 1.00 
PROBABILITY 


Figure 4.5 Theoretical Distribution for Runs Test 


SAS has an option (DW) which will calculate a Durbin-Watson statistic. If the 
underlying data base was purely time-series in structure, then this option would be 
ideal. The underlying data base for this regression, however, is longitudinal. 
Furthermore, the time span of the serial portion of the data is only four years. This 1s 
not enough units of sample size in order to do a Durbin-Watson Test with any degree 


of accuracy. 


J. TRANSFORMATION OF THE VARIABLES 

All of the graphical and statistical techniques that we have employed indicate 
autocorrelation. This implies that a transformation of the data is is required. The idea 
behind the transformation that we will use is to subtract out the effects of the previous 
year’s correlation from the present year’s data, and use this resultant transformed data 
for building the finalized regression model. First, a determination of the actual 


correlation is required. The calculation of the true (actual) correlation coefficient, Pas 
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TABLE 19 
RUNS TEST RESULTS FOR OVERALL REGRESSION 


RUNS 


(жж SIGNS Ex 


R83 R84 R85 


R82 


BN 
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(TOTAL = 54) 


TOMO 
AQ 


MNO 
mH OO 


0 


for the data base for our overall regression is according to the following formula. 
СКС -0- 


*1A,85**14,84 ™ رء٭‎ ۸۸۶1۸۰۰۹ ۰  8ٰ'>ٴ>؛+ٰ'ٰهٰ'ئ'>-‎ ٤٥١ 
ра = — _ —_—_—_—— _—_—________ 


2 2 2 2 
21 А 84 ч €1A,83 T €14,82 *  tipg4 Poes 


Substituting the residuals from the regression (Table 19), this implies that the 


true correlation coefficient for this overall regression is 


(.002043)(-.008464) + (-.008464)(.008645) + (.008645)(-.001214) + (-.001031)(-.002061) + ...... 





р. = 
2 2 2 2 
(-.008464) + (.008685) E (-.001241) + (-.001031) + 22 


= .175482 


A positive value for p, is consistent with all of the other indications of correlation. 
For the first data line (BN 1A, 1982) the transformation of the independent and 


dependent variables are according to the following formulas. [Ref. 8:p. 510] 


* / 
x il (1-5, 1/2 Xi 1 
A P (4.1) 


where 12 PROP,BNPER,RCTR,PERCWI 


For the last 215 data lines, the following equations are utilized. 


ale 
Жж 


j 1 


% 


У (4.2) 


where i= PROP,BNPER,RCTR,PERCWI 
j= 2,3,-..,216 


Again, these transformations are to nullify the effect of previous year correlation on 


the next year’s data. 
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Although we assume (and there is in fact) independence (and therefore no 
correlation) between one battalion in 1985 and another battalion in 1982, the data 
structure dictates that a transformation between these two variables is warranted. For 
instance, there 1s no correlation between battalion 1A in 1985 and battalion 1B in 1982. 
However, equations 4.2 dictate that 


У» 


ВВ ТВ 1932 — X; 1B 1982 ` Pa S 14.1985 
and 


ШІВ 1982 ”УІВ 1992” Ра У|А 1985 
where ¡=PROP,BNPER,RCTR,PERCWI 


After transforming all of the variables in the data base of the final model, we 


arrive with the Development Phase finalized matrix of longitudinal data. It appears as 
below. 






FENT PROP BNPER RCTR PERCWI 
0.406 1 14.471 0.012 52.915 0.944 bo 


0.049 | 12.520 0.010 42.817 0.791 bj 


В = 
6.955 0.021 78.398 0.750 | E 


where Y = 216x 1 matrix (a column vector of the dependent variables) 
X = 216x 5 matrix (a column vector of l's catonated with the 
216x 4 matrix of the independent variables) 


= 5x1 matrix (a column vector of parameter estimates) 


К. INSPECTING THE RESULTS 


A regression on these matrices is now performed with the results as displayed in 


Table 20. The 'T' on the end of the variable names now indicate a transformed 
variable. 
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SAS 


TABLE 20 
RESULTS OF REGRESSION ON TRANSFORMED DATA 


DEP VARIABLE: PENTTRA 


VARIABLE DF 
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- 0.000537 RCTRT - 0.056823 PERCWIT + £ 


This indicates that our “best regression” equation 1s 
PENTT = 0.062179 + 0.001531 PROPT + 2.68563 BNPERT 
After checking for heteroscedasticy, leverage and then repooling the data with the 
final four independent variables, we obtained a pre-transformed R2 value of .7171. 
The R value of the transformed data is now .6549. This drop 1s to be expected after 
reducing the variables via the transformations duc to thc positive autocorrelation. The 
final model of the transformed data fulfills all of the preliminary goals as outlined in 


Figure 3.1. The positive parameter estimates for PROPT and BNPERT are reassuring. 
We would expect that the Penetration would increase as the Propensity and Battalion 
percent of mission increases. The negative signs for RCTRT and PERCWIT , 
however, are worthy of discussion. 

If RCTRT has a negative value, then USAREC 1s probably experiencing negative 
returns to scale in the employment of recruiters. These results have been empirically 
substantiated by previous studies. [Ref. 6] This finding was not apparent in the initial 
regressions when CONT was the dependent variable. Obviously, more recruiters bring 
in more contracts. With PENETRATION as the dependent variable, however, the 
slope of the regression plane through the RCTR dimension in the carrier hyperspace is . 
negative, indicating negative returns to scale in the market penetration. 

The negative slope for PERCWIT is a little more difficult to explain. It must be 
remembered that this variable was always the least significant of the four significant 
variables in the stepwise regressions (it was always brought in last). Again, it 1s very 
possible that its parameter estimate is being heavily influenced by the costock of 
variables. Furthermore, its absolute magnitude is relatively high. In checking with 
Appendix B, the maximum value of PERCWI is .99. A maximum PERCWIT input 
value of x —.816272 would decrease PENTT by a total of .046830 (Вх РЕВСУТ = 
-.056823 x .816272 = 0.046830). The maximum PERCWIT input value would be 
derived by a battalion with a 99% white population that is transformed. This 1s 


calculated as 


х“ = .99. (.175482 х .99) = .816272 
where р = .175482 


А total decrease in Penetration of .046830 is significant when one considers that the 
average value of Penetration is .051157. This further supports the theory that the 
parameter estimate for PERCWIT is highly influenced by its costock. 

After satisfying ourselves that the ‘best equation’ has been obtained to this point, 
it is now time to move into the Validation and Maintenance Phase of the GSM I-ITIA 


model. 
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V. VALIDATION AND MAINTENANCE OF THE GMA-I-IIIÀ MODEL 


In this section we will discuss a few techniques for verifying and updating the 


GSM I-IIIA model. It may be useful to review Figure 2.3 of Chapter 2 at this time. 


A. CHECK FOR SYSTEMATIC LACK OF FIT 

Much work has been accomplished towards the development of this model. 
Many checks and balances have been performed along the way for compliance with the 
application of the theory of multivariable regression analysis. As was indicated in - 
Table 20, we have achieved a final R value of .6542 for the transformed data model. 
A few final checks need to be performed to ensure that there 1s no lingering systematic 
lack of fit. 

First of all, a plot of the residuals to check for normality 1s shown in Figure 5.1. 
A normal, symmetric distribution seems to be indicated. A Chi-Squared goodness of 
fit test is performed on these residuals. The hypothesis is H: the residuals are 
normally distributed. The level of significance of this test is @ (actual) = .4003. Since 
a (actual) > d (critical), and since the graphical representation indicates no apparent 
problems, we fail to reject the null hypothesis that the residuals are normally 
distributed. 

Secondly, we need to ensure that the transformation that was applied using 
equations 4.1 and 4.2 on page 60 ıs effective in nullifying the effects of autocorrelation. 

Longitudinal data presents special problems due to its structure. Autocorrelation 
is a almost always a time series problem, and we have a mixture of cross sectional and 
time series data. The runs test 1s especially applicable to this type of data structure. A 
runs test was performed on the residuals from the transformed data and the results are 
as appears іп Table 21. Comparing Table 21 with Table 19 indicates that there is 
much less of a problem now with too few runs. In fact, the middle distributions of two 
and three runs has shifted dramatically toward the three runs side. A distribution like 
this indicates possible negative correlation. This would really be considered a weak 
indication, however, because the skewness of the distribution in Table 21 is weighted 
more in the center than in the tails. A better indicator might be a check of the final 


calculation of f... 
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Figure 5.1 Graphical Inspection for Residual Normality - Transformed Data 


Calculating p, in the exact same manner as before, we derive a value of p, = 
-0.0335. The negative sign confirms our suspicions of possible negative correlation, 
but, by inspection, the magnitude of p, indicates that autocorrelation has been 
removed from the model. 

Since there is no suggestion of systematic lack of fit in the model, we can assume 
that the statistical tests that were utilized to derive the parameter estimates were valid. 


Now it 1s time to check these parameter estimates. 


B. MODEL RANGES AND VALIDATION 
There are several methods which can be employed to validate our model 


equation. As stated in Chapter 4, the equation is of the following form. 


РЕМТТ = 0.062179 + 0.001531 РВОРТ + 2.68563 ВМРЕВТ 
- 0.000537 RCTRT - 0.056823 PERCWIT + € 
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One of the quickest and easiest methods is to check the equation at the midpoint 
and at the extremes of the data ranges. By inserting the mean values of the 
independent variables on the right hand side of the above equation, we would expect 
the resultant equality to be equal to the mean value of Penetration. This is because, by 
definition, Y = ВХ . Another check is to look at the minimum and maximum values 
of the dependent variable. First we choose the battalion with the lowest value of 
Penetration. Then we insert into the equation the data that corresponds to this 
minimum value. We would expect that the resultant value of PENTT from this 
equation would be moving away from the mean and towards the minimum value of 
Penetration. The same logic also applies for the maximum value of Penetration. 

Appendix B provides all of the relevant data that is required to initiate these 
tests. Appendix B also contains the data ranges for which this model is valid. 
Regression theory dictates that the regression equation is relatively reliable near the 
means of the inputted data ranges. At the extremes it is much less accurate. For any 
inputted data values outside of the data range, the model can be considered to have no 


predictive value. From Appendix B, the means of the data ranges are as follows: 


DATA AT THE MEAN => 
PEINT PROP BNPER RCTR PERCWI 
(MEAN) 0.05115 14.48 0.0183 88.69 0.8429 


Taking the minimum and maximum values of PENT from Appendix B, we search 
the data base to find the corresponding input variables for these values. The minimum 
Penetration over the four year time span was obtained by battalion 6J in 1982. The 
maximum Penetration was by battalion 3D in 1983. The variables for these two 


extreme values of Penetration are as follows: 


DATA AT THE MIN => 
PENT PROP BNPER RCTR PERCWI 
(61982) 0.01967 8.0 0.0136 590 0.9431 


DATA AT THE MAX => 
PENT PROP BNPER RCTR PERCWI 
(3D;1983) 0.10396 23.9 0.0163 68.25 0.6676 
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Applying the transform where х= х. (Pa) X for all of the above variables 


(where p, — .175482), the following transformed variables are derived. 
PENT PROP BNPER RCTR PERCWI 
(MEAN) 0.04218 Ия 0.0150 9۵٣2 00 
(6.1982) 0.01622 6.59 0.0112 48.63 0.7716 
(3D;1983) 0.08572 19.71 0.0134 56.27 0.5504 


Inserting the values of the independent variables in the regression equation 


supplies the following results. 


TEST AT THE MEAN = > 
0.062179 + 0.001531 (11.93) + 2.68563 (.0150) - 0.000537 (73.12) - 0.056823 (.6950) 
= 0.04221 | | 


TEST AT THE MIN => 
0.062179 + 0.001531 (6.59) + 2.68563 (0.0112) - 0.000537 (48.63) - 0.05682 (.7776) 
= 0.03204 


TEST AT THE MAX => 
0.062179 + 0.001531 (19.71) + 2.68563 (.0134) - 0.000537 (56.27) - 0.056823 (.5404) 
— 0.06741 


As can be readily seen, the test at the mean provides an estimate of the 
dependent variable (0.04221) that is extremely close to the mean value of the 
transformed dependent variable (0.04218). The discrepancy is due purely to roundoff 
error. At the extremes, the magnitude 1s not nearly so close. This lack of accuracy 15 
not, however, unexpected. At the extremes, we are satisfied that the equations provide 


predictions that are in the correct direction. 


C. USING THE REGRESSION EQUATIONS 

Once that we are satisfied that the regression equations are behaving correctly, 
we can begin to utilize the model as a tool for predicting GSM I-IIIA contracts. 

As was previously stated in this thesis, one of the primary objectives is to 
minimize the number of input variables in the model. For every independent variable 
that is included in the model, the analyst must devise some scheme to predict that input 
variable. It does not matter how close of a fit one can achieve with a predicting 


regression model. The results can only be as accurate as the inputted data. 
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Ways of predicting the independent variables for this particular model could be 
the subject for several more theses. The desired complexity is left totally to the 
discretion of the analyst. 

Some variables, such as RCTR and PERCWI are relatively stable and fairly 
predictable. Predicting experienced recruiters for a future year may merely entail 
looking at unit manning rosters. The use of the prior year estimate for PERCWI 
might be the most logical choice for the next year's prediction. 

The variable BNPER is relatively stable for some battalions, but suffers a wide 
variance in others. Again, unless the analyst has some reason to feel otherwise, 
possibly using the previous year’s data for next year’s prediction might be the most 
reasonable choice. 

Propensity is the most significant variable in the regression equation. We would 
like to be as accurate as possible in the prediction of this variable. The variance for 
this variable has been dissipated due to the fact that we are using a four year moving 
average. Propensity may be particularly attractive to more complex regression 
techniques since it is a ‘catch all’ type variable and may be partially explained by 
several other controllable variables. | 

There are numerous methods that an analyst can utilize to predict future year 
carrier variables. For illustrative purposes, this study will make a few simple 
assumptions for a 1986 data base and apply the proper methods of applying the 
regression equation. If the analyst wishes to predict the propensity for any one 
particular battalion, he should follow the same methodology that was utilized in testing 
the minimum and maximum values. That 15, merely estimate the values of the 
independent variables for the battalion under consideration, transform and insert these 
values into the regression equation. If the analyst wishes to predict contracts for the 
entire Army, he must estimate values for the entire data base. A simple example of 
this procedure is provided. 

The following assumptions will be utilized to determine the 1986 data base for 
the GSM I-IIIA model. These assumptions are merely hypothetical and are not based 


on any factual data or observations. 


1) PROP - Assume a 2% across the board drop in propensity from 1985 
levels for every battalion. 

2) BNPER - Due to changing economic conditions, allocate an increase of 
0.02 % to each battalion in the Sth Brigade (except 4A and 4C) 
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and a 0.02 % decrease in each battalion in the 6th Brigade from 
1985 levels. 

3) RCTR - Assume a net gain of two recruiters per battalion over 1985 
recruiter endstrengths. 

4) PERCWI- Assume the same white percentage population as in 1985. 


The data base for 1986, under the above assumptions, would be structured as shown 
below. A comparison with Table 1 on page 12 displays the differences between the 


1985 data base and this assumed 1986 data base. 


BN PROP86 BNPER86 RCTR86 PERCWI86 


1A 13.7 0.0129422 52.00 0.959761 
1B 15.4 0.0287982 152.50 07272306 
6L 6.5 0.0247549 99.00 0.909940 


After applying the necessary transformations as specified in equations 4.1 and 4.2 


on page 60, the finalized matrices for the assumed 1986 data base are as shown below. 


* | 


У рд 1 13.48 0.0127 51.193 0.944 0.062179 | 

Y IR 1 11.29 0.0107 42.874 0.791 0.001531 

oe =| NM a : В = | 2.68563 
: : : : -0.000537 

У 6]. | 6.101 0.018 88.841 0677 |-0.056823 


where Y ue = 54x] matrix (a column vector of the dependent variables) 


X gg 7 54x 5 matrix (a column vector of 175 catonated with the 
54 x 4 matrix of the independent variables) 
) = 5x1 matrix (a column vector of parameter estimates) 


Multiplying the X matrix times the f matrix will result in a 54 x 1 matrix of the 
transformed y values (PENTT). This matrix represents the model's predictions for 
transformed penetration in each battalion in 1986. In order to solve for total 


contracts, we need to 'untransform' the y values and multiply the resultant matrix 
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times the estimated number of HSSMA for each battalion. Since we transformed the 


data by Y, = (yg - P (y, p), we 'untransform' using the following equation. 
ve - p (у, р) + ХЖ, 2ح۳‎ 


Yge ^ P(Ygs)* X gg (D) 
P (Ygs)* Y бе 


This implies that 


‚0543 ‚03586 ‚04543 


0562 04010 05001 
У86 = .175482| :| + Ex 


| 0378| ЕГІ |.04395| 


Using the USAREC estimates (as of 20 June, 1986) for the number of 1986 High 
School Male Market Available (HSSMAg¢), the following matrix equations will 


provide the number of contracts per battalion for each of the 54 battalions represented 
in the model. 


04543, 
05001. п 
Ygg X HSSMAgg = : | x [12396 27547 ... 22784] 
1.04395 
- [563 1377 .... 1001] 


Taking the sum of all of the individual battalion contracts will result in the 
aggregate number of Army contracts predicted in 1986. 


ШЫ Armyontracts.s. 363 + 1377 + ... 001 
= 50,132 


Therefore, under the assumptions that we specified for the 1986 data base, total 
Army GSM I-IIIA contracts for the 54 included battalions in 1986 should equal 


50,132. This compares with 50,794 in 1982; 62,781 in 1983; 51,359 in 1984; and 55,098 
in 1985. 
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V]. CONCLUSIONS AND RECOMMENDATIONS 


In this thesis, the problem of building a predictive model in order to determine 
high quality Army enlistment contracts was formulated and solved using stepwise and 
ordinary least squares linear regression analysis. 

The model was developed using a readily available data base and easily obtained 
variables. It is simple in structure and requires the analyst to predict only a limited 
number of input variables. All of these aspects contribute towards the desired goal of | 
developing an easy-to-understand and easy-to-update regression model. 

This model could be used as a framework for the continued development and 
refinement of a predictive model to be used by USAREC and DCSPER analysts. 
There is a need for a ‘quick look’ predictive tool for getting fast answers to a variety of 
proposed policy changes. Army analysts at USAREC and DCSPER are trying to 
upgrade and refine their capabilities in this area. 

In concluding this study, a few recommendations are in order. First of all, there 
needs to be a concerted effort to continually maintain and update the relevant data 
bases under USAREC control. The mathematical formulations and theories that are 
used in the technical analysis are useless without an accurate data base. Furthermore, 
the data maintained by USAREC is highly susceptible to the cffects of autocorrelation. 
In order to efficiently counteract this undesirable side effect, all of the data must be 
assimilated in time specific intervals. Monthly, quarterly or yearly data bases need to 
be established. Some conscientious and straightforward method needs to be developed 
in order to measure or estimate the variables. After this methodology is developed, it 
needs to be well-documented. A universal understanding of the data by both the 
on-line analysts and potential external/contractor analytical assistants is essential. 

Also, much work could be done towards predicting input variables for this model. 
Propensity is the most significant variable in this model and there are probably several 
variables in the data base which affect the propensity of individuals to join the Army. 
Discovering how income per capita or unemployment rates are reflected in the 
propensity for service could lead to some insight into the enlistment process. 

A more accurate assessment of the behavior of individual battalions could be a 


worthwhile project. This study models the ‘typical’ battalion and is useful in 
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interpreting and comparing against the average. A more detailed study of each 
individual battalion could prove to be fruitful in leading to an understanding of the 
Variances in the cross-sectional behavior over time. 

Finally, there needs to be a continued emphasis on the efficient allocation of 
recruiters. It is the one variable that is most easily controlled by the Army personnel 
establishment. The negative returns to scale that were discovered in the development 
phase of this model 1s somewhat unsettling. In a large and dispersed organization such 
as USAREC, some negative returns may be unavoidable. This is especially true when 
mission takes priority over costs. Its existence needs to be recognized, however, and 


positive control measures need to be implemented, continually assessed, and updated. 
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APPENDIN A 


SELECTED GLOSSARY OF REGRESSION TERMS 


Definitions of selected regression terms are presented as follows: 


Adjusted R? (RA) - A statistic where an adjustment has been made for the 


corresponding degrees of freedom of the two quantities, the Residual Sum of Squares 
(RSS) and the Corrected Total Sum of Squares (CTSS). The idea behind the RM 15 


that this statistic can be used to compare equations fit not only to a specific set of data 


but also to two or more entirely different sets of data. This statistic 1s usually used 


only as an initial gross indicator. [Ref. 3:p. 92] 


Analysis of Variance (ANOVA) Table - Format for the presentation of key statistics of 


a regression model. Typically, it is given as follows: [Ref. 3:p. 20] 


Source of Degrees of Sum of Squares 


Variation 


Due to the 
Regression 
(MODEL) 


About the 
Regression 
(ERROR) 


Total, 


Corrected 


for the Mean 


where Y; = Y (actual) 


Freedom (df) 


ТЕКТІ! 


1-1 


(55) 


n 

A 
NY; Y)? 
i= | 
n 

A 
O 
i= | 


n 
Y YY 


12 | 


1 


n = number of observations X 
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-А 
Y. = Y (predicted) 


Mean Square F Value Prob>I 
(MS) 
MS MSpeo/S 
Reg Reg 
( 7 SS/df) 
s? — SS/(n-2) 


hamum 


Y = Y (average) 


number of predictor variables 


Alpha (4) - & 1s the level of significance. It is the maximum probability of rejecting a 
true null hypothesis (I15). [Ref. 9:p.78] 


Autocorrelation - Autocorrelation is a situation, usually found in time series data, in 
which the impact of a independent variable on the dependent variable is not always 
completely instantaneous. This implies that there is a a correlation, usually over time. 
Also known as Serial Correlation: [Ref. 10:p. 289] 


Backward Stepwise Elimination Procedure - A procedure that tries to examine only the 
‘best’ regressions containing a certain number of variables. The basic procedure is as 
follows: 
1. A regression equation containing all of the variables is computed. 
2. The partial F-test value is calculated for every predictor variable 
treated as though it were the last variable to enter the regression 
equation. 
3. The lowest partial F-test value, say Fl, is compared with a 
preselected significance level, say Fo. 
a. If F1« FO, remove the variable which rose F] from consideration 
and recompute the regression equation in the remaining variables. 
Then reenter stage (2). 
b. If F1» FO, adopt the regression equation as calculated. [Ref. 3:p. 305] 


Carrier Variables - See Independent Variables. 
Coefficient of Determination - See R2 [Ref. 10:p. 146] 


Confidence Coefficient - Confidence Coefficients are used when speaking of confidence 
intervals. The confidence coefficient is the number (1-@) x 100 percent. Therefore, at 


an a equal to .05, the confidence coefficient is equal to 95 percent. [Ref. 10:p. 55] 


Corrected Sum of Squares - The Corrected Sum of Squares (CSS) is the value obtained 
when the Correction for the Mean is subtracted from the Uncorrected Sum of Squares. 
Notationally, this is CSS = YX; - (YX;%)/n and is called the Corrected Sum of 
Squares for the X’s. [Ref. 3:p. 14] 


Corrected Sum of Products - The Corrected Sum of Products (CSP) 1s the value 


obtained when the Correction for the Mean is subtracted from the Uncorrected Sum of 


da 


Products. Notationally, this is CSP = УХ, - (2,X:)(2,Y:)/n and is called the 
Corrected Sum of Products for X and Y. [Ref. 3:p. 14] 


Correlation Coefficient - The correlation. coefficient, provides an empirical 


Puw? 
measure of the linear association between U and W. Its values can be between -1 and 
1. When f 1s nonzero, this means that there exists a linear association between the 
specifics values of x; and y; in the data. The value of a correlation Pxy shows only the 
extent to which x and y are linearly associated. It does not by itself imply that any 


sort of causal relationship exists between x and y. [Ref. 3:p. 43] 


C(P) Statistic - The C(P) statistic is used to assess the fit of a regression equation. It is - 
closely related to the R and adjusted R statistic. A close fitting model will have a 
low C(P) value close to P, where P is the number of parameters in the model including 
po. If several models are being contemplated, one method to determine the "best" 
model is to plot C(P) vs P for all of the models and then choose the model where C(P) 
falls closest to the P line. One word of caution, however, is that smaller models have 
smaller values of C(P), but larger models have C(P) values closer to P. If a low C(P) 
value close to P is not clear cut, then the analyst must make a decision. See reference 


for more complete details. [Ref. 3:p. 299] 


Degrees of Freedom - Degrees of freedom (in regression) is a number that 1s associated 
with any sum of squares. This number indicates how many independent pieces of 
information involving the n independent numbers Y1, Y2, Y3, ... are needed to compile 


the sum of squares. [Ref. 3:p. 19] 


Dependent Variable - The receptor of changes that are deliberately made or that simply 
happen to the independent variables. Also called the Response Variable, it 1s the value 


that a regression model is trying to predict or control. [Ref. 3:p. 3] 


Dummy Variable - A variable used as an independent variable that is arbitrarily picked 
by the analyst. It is introduced to factor two or more distinct levels of data that may 
have separate deterministic effects on the dependent variables. They are usually (but 
not always) unrelated to the any physical levels that might exist in the factors 
themselves. [Re 3p 241] 


Endogenous Variables - Variables that are jointly determined or that have outcome 
values determined through the joint interaction of other variables within the system. 
Ке тор. 339] 
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Exogenous Variables - Exogenous variables affect the outcome of the endogenous 


variables, but are determined outside of the system. [Ref. 10:p. 339] 


F Test for the ANOVA Table - F equals the ratio of the Mean Square due to the 
Regression divided by the Mean Square about to Regression. Algebraically, itis F = 
MSReg / 52. (see Analysis of Variance definition). This value 1s then compared to the 
100(1-a) % point of an F distribution with (N, - N,) and N, degrees of freedom. If 
the ratio is significant (ie -prob>F in ANOVA Table is greater than the selected 
100(1-a@)% ) than the model is probably inadequate and attempts should be made to 
discover when and how the inadequacy occurs. If the F value is insignificant (ie - 
prob» F in ANOVA Table is less than the selected 100(1-a@)%), then it 1s reasonable to 
assume that the model is accurate and that the pure error (or residual error - 52) апа 


the lack of fit (MS) mean squares can be used as estimates of 62. [° ef op. 37] 


Forward Stepwise Regression Procedure - A technique which begins with no variables in 
a model. For each independent variable, a F statistic 1s calculated to reflect that 
particular variables contribution to the model if it is included. Variables are then 


included in the order of most significant to least significant. [Ref. 4:p. 102] 


General Linear Hypothesis - The General Linear Hypothesis is of the form -- Y = Во 
+ Р.Х, + В5Х. + e, where y is the dependent variable, X, and X» are the 
independent variables, D, is the intercept value, D, and фр, are the ‘coefficients’ or 


parameter estimates and £ 1s the error term. [Ref. 3:p. 102] 


Heteroscedasticity - Heteroscedasticity is a situation in which the random errors (zg; s) 
from the statistical regression model have different (non-constant) variances. 
ШЕСІ” І0:р. 299) 


Homoscedastic - A situation where there is an identical variance in the random errors. 


Homoscedastic 1s the converse of heteroscedastic. [Ref. 10:p. 119] 


Indempotent Matrix - An indempotent matrix 1s a special form of a matrix that is 
symmetric and that holds the following two properties. [Ref. 10:p. 31] 

1) M = M' and 

2)MxM = M?=M 
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Idependent Variable - Variables that can either be set to a desired value or else take on 
values that can be observed but not controlled. Also known as Carrier or Predictor 
Variables. [Ref. 3:p. 3] 


Lack of Fit - A situation in which a postulated model is not correct. Lack of fit is 


present when the residuals contain both random AND systematic errors. [Ref. 3:p. 34] 
Level of Significance - See @ 


Least Squares - A concept having to do with minimizing the square of the distance 
between an actual and predicted value. See Chapter 1, Section E for a detailed 


explanation. 


Latent Variables - Variables that are not incorporated in a regression equation (or, 
perhaps, are not even measured) that contribute to the error in the model. Also called 
Lurking variables. [Ref. 3:p. 295] 


Lurking Variables - See Latent Variables. 


Multicollinearity - Also known as ill conditioning, multicollinearity is a situation in 
which there 1s an interrelationship amongst the predictor (or carrier) variables. These 
interrelationships will adversly affect statistical results which may cause estimated 


values to be far from the true values. [Ref. 10:p. 610] 
Multiple Regression - Regression using more than one explanatory (or carrier) variable. 


Nonsingular Matrix - A square matrix whose determinate is nonzero. Nonsingular 


matrices have full row rank (all rows and columns are linearly independent). [Ref. 11] 


Normal Equation for Multiple Linear Regression - The general linear equation for 
multiple linear regression in matrix form is as follows. [Ref. 3:p. 74] 

ХХВ = ХҮ 
Overfitting - The fitting of regression equations that involve more predictor variables 


than are necessary to obtain a satisfactory fit to the data. [Ref. 3:p. 298] 


Outliers - An outlier 1s a point that is far from the mean in absolute value and 1s, 
perhaps, several standard deviations away from the mean. In regression analysis, a 
residual that is an outlier comes under close scrutiny in order to determine if its 


peculiarity can be established. [Ref. 3:p. 152] 
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Parameter Equations for Simple Linear Regression - The general equations for 


estimating simple linear regression parameters arc as follows. [Ref. 3:p. 14] 


By = Syy/Sxx and Bo = Y-B] 


where: یہ‎ INN; - o» 
SS YX: - nX 


Residuals - Residuals (often denoted €;) is the difference between the actual value of y 
and the predicted value of y. Algebraically, this is denoted as Y; - Y:. The residuals 
contain all the information on the way in which the regression model fails to explain 


the observed variation in the dependent variable. [Ref. 3:p. 34] 


Residual Plots - Plots of residuals versus other parameters in the regression. For 
analytical purposes, the plot of e; versus Y is common. The reason that the residuals 
are plotted against the predicted values 1s because the covariance between these two 
values (Cov(2,Y)) is equal to 0, whereas the covariance between the residuals and the 
actual values is not. (actually, cov(£;, Y:) = с? qe) 


Ridge Regression - A regression procedure that 1s intended to overcome certain lack of 
fit situations where correlations between the various carrier variables in the model 
cause the X’X matrix to become close to singular, giving rise to unstable parameter 
estimates. (The estimates may, for example, have the wrong sign or be much larger 


than physical or practical considerations would deem appropriate). [Ref. 3:p. 313] 


R? - R? measures the proportion of total variation about the mean Y explained by the 
regression. Algebraically, R? - (SS due to the Regression)/(Total SS, corrected for 
the mean Y) = Y, - ү)? / У (У yy? As more variables are added to the regression, 
R2 (unlike adjusted R2) will never decrease. [Ref. 3:p. 19] 


Stepwise Regression Procedure - A technique which begins with no variables in a 
model. For each independent variable, a F statistic is calculated to reflect that 
particular variables contribution to the model if it is included. Variables are then 
included one by one in the order of most significant to least significant. Unlike the 
Forward Stepwise Regression Procedure, however, once a variable is entered, a 
regression 1s performed on all of the variables that are currently in the model, and any 
variables that may now have an F statistic which is less significant than the newly 


entered variable will be removed from the model. [Ref. 4:p. 102] 
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Weighted Least Squares - A regression technique used when some of the carrier 
observations are ‘less reliable’ than others. This is usually indicated when the variances 
of the observations are unequal or, sometimes, if the various observations are 
correlated. The basic idea is to use a transform of the observations to other variables 
that do fit the basic assumptions of the ordinary least squares model and then apply 


the usual (unweighted) analysis to these new variables. [Ref. 3:p. 108] 


X'X Matrix - Matrix notation format for determining the УХ; | 
ух, and n. It 1s of particular use in multiple regression for ease of computation. 
The X'X matrix is determined as follows and is used in the Normal Equation for | 


Multiple Linear Regression (see definition). [Ref. 3:p. 74] 





25 
X Y Matrix - Matrix notation format for determining the DIDI Ух; bos 
It is of particular use in multiple regression for ease of computation. The X'Y matrix 


is determined as follows and is used in the Normal Equation for Multiple Linear 


Regression (see definition). [Ref. 3:p. 74] 


7 


уу, " 
ХУ = 

УХ;Ү; 
X X inverse Matrix (X mu - The X'X inverse matrix 1s an extremely important concept 
in multiple regression calculations. The calculation of this matrix allows for the solving 
of the multiple regression equations. This matrix must be nonsingular. When both 
sides of the Normal Equation for Multiple Linear Regression are multiplied by Xo 
the resultant matrix is the matrix of the estimators of the coefficients, D. The X'X 


inverse matrix is calculated as follows. [Ref. 3:p. 78) 


2 
хс LX 
-УХ; n 


(XX) 7 (/nY (X; - X)^) 
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APPENDIX B 
THE VARIABLES 


A detailed listing of variable information, definitions and statistical 
data follows. Statistical information does not include BN 6E (Honolulu). 
Variables that appear in the final model are analyzed first, complete 
with Шоп The other variables ЕШ! later with a less rigorous 
summary. here was no attempt to weight any data elements. 0 
estimates are derived from performing statistical analysis on the raw 
data as given. | 


The following variables appear in the finalized model. 


kc ok oe oe oe ok oe oe oe oe oe oe oe oe ok ok oe oe ok ok x * x * x x x‏ ا x x * x PENETRATION‏ پر پا پر ار پر ار <* x‏ ار * x * x í * x‏ * ہار ار پر ار جار جا جاو ار 


VARIABLE NAME: PENT 


DESCRIPTION: Contracts divided by HSMMA by battalion DY year. 
Penetration actually shows what percent of the market that actually 
contracted with the Army. 


UPDATED: As Contracts and HSMMA are updated. 


MEAN MEDIAN ST. DEVIATION MIN MAX 
"U51157 . 048616 2016318 . 019678 210397 


ر مر لر جار جاج جلو جار ہار ماج ہار جاو ا جار ار ہار ار جاو جار جار جار جار ہار جار جاو ا جار جار باو ار xx PROPENSITY‏ ار ار لر XX o oc oe oe oe oe oe oe oe oe oe oe ok oe oe oe ook oe oe oe oe oe‏ 


VARIABLE NAME: PROP 


DESCRIPTION: Army Positive Propensity measure. Four year moving 
average of the percent of positive respondents to questions 
about military and Army service on the Youth Attitude 
IE ing Survey (YATS). The data is presented as percent 
imes 


UPDATED: Fall quarter (actual), other quarters (estimated) 


MEAN MEDIAN ST. DEVIATION MIN MAX 
14. 48 14.1 4. 5095 6.4 2 


MAA AMA A A ار با‎ x í * * e eoe پر پا‎ x پا‎ í ke BATTALION PERCENT жжхжхххххххххххххххххххххх 
VARIABLE NAME: BNPER 
90 Contracts divided by the total number of contracts 


signe 
in any given ear.  BNPER tells what percent of the total number of 


incoming GSM Í-IIIA recruits were accessed by that particular 
battalion. 

UPDATED: Daily as contracts are updated. 

MEAN MEDIAN ST. DEVIATION MIN MAX 
“0183 . 0179 ‚esse, . 0062 . 0334 


HISTOGRAMS OF MODEL RAW DATA 


PROP 


«o 
ға 


» 


à 
з 
|. 


MSR OF QATA POINTS (218 TOTAL) 
ж 

MER OF DATA PONTE (218 TOTAL) 
ж 





Figure B. 1l Histogrm Distribution of Final Model Variables 


* x x X x X X k K e e e e X e ve xe ve ve ve پار پار‎ K x x K RECRUITERS y o x x x K x x x X X X k x e X x X k k k X x X k X x X x X x £ 


VARIABLE NAME: RCTR | | 
DESCRIPTION: Average number of on-production recruiters assigned. 
on produser means all recruiters actively recruiting and assigned 
contract quotas (missions). 


UPDATED: Yearly, or as desired by checking unit manning rosters. 


MEAN MEDIAN ST. DEVIATION MIN MAX 
88. 69 83 22539 5 165 


k k x k x x x X x X پار‎ K X x x e e ve ex Ж PERCENT WHITE y oe k x X k X k e k k k k X X e k X k X k X k x X x X x X x e x 


VARIABLE NAME: PERCWI 
DESCRIPTION: WHIPOP divided by TOTPOP. PERCWI tells the percentage 
of total population within a battalion are white. 


UPDATED: Every census (actual), each year (estimated) with 5-year 

projections available every year. 

MEDIAN ST. DEVIATION MIN 
. 86 ‚ 098515 5 


МЕАМ [ МАХ 
‚ 84296 5 2 os 
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00٣ 
used in the derivation of other variables. | These data points are 
maintained at USAREC headquarters at Fort Sheridan, Illinois. 
۱۱:۱ 


x o oe e He ok He ek kK Ke kK eo ve e eec e vec x BATTALION He He KK I eK I I He dece ee e eve e ve e پر‎ He 


VARIABLE NAME: BN 


(BR OL nip i USAREC recruiting battalion reference codes 
BN 3L not provided) 


UPDATED: As organizational realignments dictate 
* oe oe oe e e e e < * He He eoe He He eK IK KE * * *< < * < * < * He He YEAR 3X o e oe se eoe ve e ve ee eoe eoe ہار‎ OK eoe * exe A < x < e HK 
VARIABLE NAME: YR 
DESCRIPTION: fiscal year (1982 to 1985) 
UPDATED: 1 October of each year 
* o oe ve oe ve e e ve ve vec ve e ve ve ve ve de e e e e ee ve e xx CONTRACTS He Fe e ve e x x e e e ve e e dec ve e ve se ve ve veo dese ve oe e e ve xx 
VARIABLE NAME: CONT 
DESCRIPTION: Number of GSM I-IIIA contracts actually written per year 


UPDATED: daily throughout the year 


MEAN MEDIAN ST. DEVIATION MIN MAX 
ھ8‎ 5 975 325,2] 219 2021 


* e o < e ee ee e ee see ee e ve A e ke KK oe UNEMPLOYMENT FC KK KK e eee e eee o ee ec e ec ve e e v x x 
VARIABLE NAME: UNEM 


DESCRIPTION: Average total unemployment in a given battalion for a 
for a given year. he data is presented as percent times 100. 


UPDATED: Yearly by the Bureau of Labor Statistics with subsequent 
(by zipcode) updates by USAREC to fit into battalion structure. 


MEAN MEDIAN ST. DEVIATION MIN MAX 
8. 68 54599 C d. 39 15. 43 


am cc HIGUIASCHOOE MALE MARKET AVAILABLE (CAT I-IIIA) ********** 
VARIABLE NAME: HSMMA 
"DESCRIPTION: Measured or predicted size of available 00 of high 

school seniors or high school graduates within the last two years 
that are in mental category I-IIIA. Also known as the market. 

All variables were as given by USAREC except for HSMMA for 1985. 
HSMMA for 1985 was the average value for HSMMA84 and HSMMA86 (as of 
June 25, 1986) 

UPDATED: Random times throughout the year by USAREC. 


MEAN MEDIAN ST. DEVIATION MIN MAX 
21783 225 83953 8172 46120 


ХХ e e e eoe e eoe eoe ee ee ee e e x PAY COMPATIBILITY MAA AC A AA A ke * * * * * * * A A ke He He He لر‎ 


VARIABLE NAME:  PAYCO 
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DESCRIPTION: Civilian to military pay compatibility. This is the 
1ifference in the yea E eer percent чаш between 

income per capita and the Basic pay for an E-1 under four ` 

months of active duty service. ata is given in percent times 100. 
UPDATED: As INCOMPC and E-1 PAY is updated. 


MEAN MEDIAN ST. DEVIATION MIN MAX 
6.45 3-6] 3.34 1 12.7 


KKKK KKK KKK KK KKK KKK KK KK TOTAL POPULATION 3 o o e e ve ve eoe ve eoe eoe ye oe e e ve ye eoe ye y Yx yk A A ЖК 
VARIABLE NAME:  TOTPOP 
DESCRIPTION: Total population within a battalion area. 


UPDATED: Every census (actual), each year (estimated) with 5-year 
projections available every year. : 


MEAN MEDIAN ST. DEVIATION MIN MAX 
4. 15E6 4. 02Е6 6یھب(‎ 2. 06Е6 8. 92Е6 


KA e ve se ve e ve ve ve ve e ve ve veo ve e ve x xx x WHITE POPULATION x e * e oe e e ve eve ve e ye eoe e ve ve e e ve ve e e e ve x x x 
VARIABLE NAME: МНІРОР 
DESCRIPTION: Total white population within a battalion area. 


UPDATED: Every census (actual), each year (estimated) with 5-year 
projections available every year. 


MEAN MEDIAN ST. DEVIATION MIN MAX 
3. 45E6 3. 44E6 8. ЗЗЕБ 1.816 6. 10E6 


жхжжххххххххххжжххххххх BLACK POPULATION жжжжжжхжхххжхххххжхххххххххжхххх 
VARIABLE NAME: BLKPOP 
DESCRIPTION: Total black population within a battalion area. 


UPDATED: Every census (actual), each year (estimated) with 5-year 
projections available every year. 


MEAN MEDIAN ST. DEVIATION MIN MAX 
4. 90Е5 ЙӨ ШЕВ ЦЕН ESTEE 6782 ۳ھ(‎ 


* s se e oe * e se ye ye eoe ve e eye ye e x x HISPANIC POPULATION kkk KKK KK KKK KK KKK KKK KKK KKK KKK HK 
VARIABLE NAME: НІЅРОР 
DESCRIPTION: Total hispanic population within a battalion area. 


UPDATED: Every census (actual), each year (estimated) with 5-year 
projections available every year. 


MEAN MEDIAN ST. DEVIATION MIN MAX 
25 5 4. 40Е5 10496 2826 


жжжхххжхххххххххххххх INCOME PER CAPITA KKK KKK KKK KKK KKK KK KKK KK KK KKK KKK 


VARIABLE NAME: 6) = 


DESCRIPTION: Average income per capita (in dollars) within a 
attalion area. 


UPDATED: Yearly by the : 


of Labor Statistics with subsequent 
(by zipcode) updates by of 


ureau 
SAREC t it into battalion structure. 
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MEAN MEDIAN ST. DEVIATION MIN MAX 
9429 9394 1374 9255 15105 


X ooo xc x x x x * oov x x QUALIFIED MILITARY AVAILABLE w x x x x x x x x < k x x x x k xx 
VARIABLE NAME: QMA i 

DESCRIPTION: Predicted number (times 100) of physically, mentally 
and morally qualified for service males within a battalion area. 
Normally predicted as a straight percentage of the total male 
population. 

UPDATED: Every two years. 


MEAN MEDIAN ST. DEVIATION MIN MAX 
1013 1165 387 330 2658 


SS EES BATTALION ADVERTISEMENT EXPENDITURES xxx xx 
VARIABLE NAME: BNADV ۱ 

DESCRIPTION: Battalion level expenditures (in hundreds of асе) 
that were speni on advertising within the battalion. Does not include 
any national advertising expenditures. 

UPDATED: Yearly 


MEAN MEDIAN ST. DEVIATION MIN MAX 
2309 303 3529 273 2211 


* پار‎ k پار‎ k x x K x x x x x x پا‎ x x x x x * x xxx E-1 PAY жххжххжххххжххжхххххххжхххххххжхххх 
VARIABLE NAME:  EIPAY 


DESCRIPTION: Basic pay of an enlisted rank 1 (E-1) with under four 
months of active federal service. 


UPDATED: Yearly as congressions pay changes mandate. 


MEAN MEDIA ST. DEVIATION MIN MAX 
359.5 573.6 9 615 551.4 573. 6 w 


ARMY MARKET SHARE хжжхххххххххжхххххххххххххххх‏ پا x x x x x x x x x x x‏ پا K x x x x x x‏ پار پر 
VARIABLE NAME: ARMYMS‏ 

DESCRIPTION: | The tota! number of contracts by the Army divided by 
the total number of Department of Defense contracts within a 
battalion. 

UPDATED: Yearly when DOD-A is updated. 


MEAN MEDIAN ST. DEVIATION MIN MAX 
EESTI 38 . 04146 ‚ 26 ‚47 


کر کار ما کار جار جار کار کار با بار بار کار ار بار ماو مار باج بار چا KKKKKKKKKKKKKKKXx DEPARTMENT OF DEFENSE MINUS ARMY‏ 
VARIABLE NAME: DOD-A‏ 


DESCRIPTION: The total number of military contracts minus the total 
number of Army contracts within the battalion. 


UPDATED: Yearly by the Department of Defense with subsequent 
(by zipcode) updates by USAREC to fit into battalion structure. 


MEAN MEDIAN ST. DEVIATION MIN 
1522 549. 89 259 


МАХ 
1658.4 3597 
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` APPENDIX C 
SAS INPUT PROGRAM FOR INITIAL REGRESSIONS 


-А 


999), 'THESISOUT' CLASS 
N $ YEAR BN CONT RCTR UNEMP PROP HSMMA PAYCO TOTPOP WHIPOP ; 


¡Na 
E . «ООСО .. 
<< NOM 


ROP HSMM 
SMMA PAYCO T 


RE 
OMPC QMA BN 
ae PAYC 


р 

OMPC QMA BNA 
H 

MPC QMA BNADV 
86 


RS: 
UNEM 
OP INC 


KPOP HISPOP INCOMPC QMA BNADV EIPAY ARMYMS DODMA ; 
А 
К 
0 


т-іг-і.. 


‚ВАТА = ALLYEARS; 


DATA DATA2: 
INPUT BL 
ID BNN; 
BY YEAR: 
BY YEAR; 


ем HH.. 


SS SO 


APPENDIX D 
SAS INPUT PROGRAM FOR INTERMEDIATE REGRESSIONS 


=) 


CES‏ ۶ھ" 


۲390) 
EM-SY2 


S 
NESIZEZ= 80; 
Р 
М 
9 
9 
9 
9 


0438 


| 
A 
E 
N 
I 


OMY) 


FË OC 


/R CORRB COLLIN VIF ; 
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1982 THEN DELETE; 
1983 THEN DELETE; 
1984 THEN DELETE; 


LKPOP HISPOP INCOMPC QMA BNADV EIPAY ARMYMS DODMA ; 
1985 THEN DELETE; 


1. 
А ПАТА2: 
INPUT B 
DATA LAG83; 
MERGE OUT82 OUT83; 
DATA LAG84; 
MERGE OUT83 OUT84; 
DATA LAG85; 
MERGE OUT84 OUT85; 


DA 


5 


PROC 


PROC 


PROC 


PROC 
ВТ TE 


PROC 


BY 
PROC 


BY 
PROC 


BY 
PROC 


BY 
PROC 


BY 
PROC 


PLOT 
DATA= LAG83 
PLOT 52568۱ / VREF=0 HREF=0; 


PLOT 
DATA= LAG 
PLOT R83* 


PLOT 

DATA= LAGSS: 

PLOT R84*R85=!*! / VREF=0 HREF=0; 
/ SLE=1 SLS=1; 


84; 
R84='*' / VREF=0 HREF=0; 


SORT DATA=0UT1; 
AR; 


PLOT | 
D1*YHAT1=BNN/VREF=0; 


|! 
mc 
Cn 
кч 


D1*PROP= BNN/VREF=0; 


TDI*BNPER= BNN/VREF=0; 


D1*RCTR- BNN/VREF=0; 


Stb1*PERCWI= BNN/VREF=0; 


TD1/MIDPOINTS=-. 021 TO .021 BY .0020; 
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10. 


11. 


12. 


ID. 
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